EFForg / badger-sett

Automated training for Privacy Badger. Badger Sett automates browsers to visit websites to produce fresh Privacy Badger tracker data.
https://www.eff.org/badger-pretraining
MIT License
119 stars 13 forks source link

in crawler.py NUM_SITES = 2000? Is this how many Privacy Badger currently trains upon? #77

Closed jawz101 closed 9 months ago

jawz101 commented 9 months ago

Can you all raise it to 100,000, a million, or as many as possible? It seems awfully low if users are not training each of their extensions against their own browsing habits.

The value of a tracker's tracking depends on each person's unique due browsing habits. It makes sense to train on the maximum amount of sites as possible rather than limiting the data set to 2000 most commonly visited sites.

I can think of car dealerships, smaller storefronts, travel sites, and local news sites with loads of trackers that are not likely in the top 2000 sites in the world.

I just opened up all of my bookmarks and privacy badger learned a new tracking domain. I'd argue simply visiting a website is not enough. If I owned a shopping store, my trackers might not be on the main page. I would want to track each product that was clicked upon which may mean my trackers are on each product page. If the test emulated clicking on a few product pages, you may find a different set of trackers on each product page.

ghostwords commented 9 months ago

Hello!

Previous discussion.

We attempt to visit 30K sites at this point. The plan is to continue to gradually increase the scope.

If we increase our training scope too quickly, we will have to deal with too many false positives and too many broken sites.

Check out our recent blog post about the evolution of Privacy Badger's learning.

ghostwords commented 9 months ago

We do already attempt to identify and visit news article links. Good idea about doing the same for product pages.

jawz101 commented 9 months ago

Is it still looking at only the top 2000 sites? Most of these sites don't necessarily use the variety of trackers out there.

I did a cursory search looking at about 8 local car dealership sites and viewed the trackers on them. Several of them had trackers which were 'green.' Ones that I would expect to be on many car & shopping sites. Dealerinspire.com, carfax.com, pushowl.com were 3 I recall offhand.

other various sites I might visit travelocity.com is ranked 5054 orbitz.com 10442 soylent.com 159821 feetures.com 292337 petco.com 5472 petsmart.com 4830 salomon.com 224504

A few of my local card dealerships are ranked somewhere in the 1-3 millions.

The top 2000 most popular sites are going to be things like bing.com, google.com, facebook.com, amazon.com, netflix.com, foxnews.com, espn.com, cnn.com... pretty generic to track someone. "they like news, some sort of sport, and watch things on a streaming service."

ghostwords commented 9 months ago

Is it still looking at only the top 2000 sites?

Did you see my earlier response? We're up to 30,000 and climbing.