EFForg / badger-sett

Automated training for Privacy Badger. Badger Sett automates browsers to visit websites to produce fresh Privacy Badger tracker data.
https://www.eff.org/badger-pretraining
MIT License
119 stars 13 forks source link

Recommendation on range and some filters #61

Closed jawz101 closed 3 years ago

jawz101 commented 3 years ago

Of the top 2000 domains on the current Tranco list, .org, .gov, and .edu represent approximately 275 of the list or about 13.75% of the domains being tested.

These are domains for organizations, government, and military use - at least in the United States- and are likely to have significantly less potential for use of commercial tracking. In fact, if any of the top 2000 websites were to track, the top 2000 sites are not going to be full of trackers (e.g. google, youtube, facebook, micorosoft, wikipedia, twitter, pinterest, amazon, netflix, vimeo, wordpress, github, windowsupdate, etc.) How many 3rd party domains do these sites have? Not many.

I suggest training further down the list. Say, 4000 thru 10000 (or more) excluding any domains .org$|.edu$|/.*.gov.|.gov$ (the 3rd rule to exclude government sites from other countries.)

Or, better yet, use a list of sites in certain categories (health/beauty/medical, crafts and hobbies, food, music, movies, entertainment, blogs, news, etc.)

A resource that does do categorizations of sites is the popular http://www.shallalist.de/categories.html

The default Privacy Badger is pretty small and should pick up your local newspapers and game sites... in short, the top sites do the tracking on the bottom sites to get you the visit the top sites.

check this

https://webcookies.org/number-of-cookies

google.com 2 cookies

compare that to any of these sites:

www.ogaracoach.com http://www.10greatlines.com/ https://newsinfo.inquirer.net https://www.favecrafts.com www.ibtimes.co.uk

ghostwords commented 3 years ago

Hello!

Daily scans at this point visit the top 5,000 sites. Scans prior to recent Privacy Badger updates have been run on 10,000 sites. To further improve pre-trained coverage prior to releasing a new Privacy Badger update, we started combining several recent scans into a single "synthetic" data set. We're now looking into automatically running crawls in parallel and combining the results. This should let us cover many more sites, eventually as part of daily scans.

We already exclude certain TLDs from scans by specifying --exclude at runtime.

jawz101 commented 3 years ago

thanks. Have you thought of maybe trying certain categories of sites? I've just been thinking about where the tracking happens

ghostwords commented 3 years ago

Not yet. It would indeed be nice to be able to break down the Tranco list by categories and geographies.

ghostwords commented 3 years ago

Tangentially related: To increase coverage, we find and click a random internal link on each website. To avoid slowing down scans too much, we use a simple heuristic to limit clicking to "news" links.

jawz101 commented 3 years ago

that's really interesting. thank you for the info. I'll go ahead and close this. Thanks again