Closed jawz101 closed 3 years ago
Hello!
Daily scans at this point visit the top 5,000 sites. Scans prior to recent Privacy Badger updates have been run on 10,000 sites. To further improve pre-trained coverage prior to releasing a new Privacy Badger update, we started combining several recent scans into a single "synthetic" data set. We're now looking into automatically running crawls in parallel and combining the results. This should let us cover many more sites, eventually as part of daily scans.
We already exclude certain TLDs from scans by specifying --exclude
at runtime.
thanks. Have you thought of maybe trying certain categories of sites? I've just been thinking about where the tracking happens
Not yet. It would indeed be nice to be able to break down the Tranco list by categories and geographies.
Tangentially related: To increase coverage, we find and click a random internal link on each website. To avoid slowing down scans too much, we use a simple heuristic to limit clicking to "news" links.
that's really interesting. thank you for the info. I'll go ahead and close this. Thanks again
Of the top 2000 domains on the current Tranco list, .org, .gov, and .edu represent approximately 275 of the list or about 13.75% of the domains being tested.
These are domains for organizations, government, and military use - at least in the United States- and are likely to have significantly less potential for use of commercial tracking. In fact, if any of the top 2000 websites were to track, the top 2000 sites are not going to be full of trackers (e.g. google, youtube, facebook, micorosoft, wikipedia, twitter, pinterest, amazon, netflix, vimeo, wordpress, github, windowsupdate, etc.) How many 3rd party domains do these sites have? Not many.
I suggest training further down the list. Say, 4000 thru 10000 (or more) excluding any domains .org$|.edu$|/.*.gov.|.gov$ (the 3rd rule to exclude government sites from other countries.)
Or, better yet, use a list of sites in certain categories (health/beauty/medical, crafts and hobbies, food, music, movies, entertainment, blogs, news, etc.)
A resource that does do categorizations of sites is the popular http://www.shallalist.de/categories.html
The default Privacy Badger is pretty small and should pick up your local newspapers and game sites... in short, the top sites do the tracking on the bottom sites to get you the visit the top sites.
check this
https://webcookies.org/number-of-cookies
google.com 2 cookies
compare that to any of these sites:
www.ogaracoach.com http://www.10greatlines.com/ https://newsinfo.inquirer.net https://www.favecrafts.com www.ibtimes.co.uk