EFForg / badger-sett

Automated training for Privacy Badger. Badger Sett automates browsers to visit websites to produce fresh Privacy Badger tracker data.
https://www.eff.org/badger-pretraining
MIT License
120 stars 15 forks source link

Ideas for increasing scan efficiency (decreasing the error rate) #79

Closed jawz101 closed 7 months ago

jawz101 commented 9 months ago

https://github.com/EFForg/badger-sett/commit/9bbeb458d49e55e8173d3e4a61b8c9a5e121edce

Of the 6000 sites tested, the log files show that the scan errors on 24% of the websites it tried to test. A summary of the types of errored sites:

UnexpectedAlertPresentException: 3 Reached error page: 1043 Reached Cloudflare security page: 65 Encountered unhandled user prompt dialog: 3 Timed out loading...: 311 NoSuchWindowException: 1 Likely bug: 1 InvalidSessionIdException: 1 InsecureCertificateException: 22

1) One thing I noticed is the scan excludes suffixes of .mil and .gov. I'd also suggest excluding .edu domains. While some have trackers it's probably not worth training upon them as they are going to be trackers on other sites already. .org suffixed sites might have a small benefit to test but not as strong as your typical .com site.

2) Other countries may use .gov., .edu, .mil earlier in their suffix. For example, epfindia.gov.in, sbv.gov.vn, conicet.gov.ar, nsw.gov.au

3) I still think more sites could be tested as long as the scan time is able to complete before the next scan. And they probably don't need to necessarily be the top 6000 sites. Even a randomized list of 20,000 of the top million sites, for example, might be even more effective. Imaging google.com and its country sites are all in the top 6000 websites visited. Google isn't a tracker-heavy site but my local car dealership is probably in the top million and stuffed with trackers.

ghostwords commented 9 months ago

Thanks for the suggestions!

We don't need to optimize everything at once but focus on the biggest problems and go from there. The greatest source of errors is domains that don't have a website on them. This is an input data issue, the Tranco list is full of such domains. One way we could improve this is by filtering the Tranco list through Chrome User Experience Report, which contains only actual website domains. The main challenge there is setting up programmatic access to CrUX data. Our scripts should be able to grab the latest report just as they grab the latest Tranco list.

As for how many websites we visit, my answer is still the same: https://github.com/EFForg/badger-sett/issues/77#issuecomment-1849029248

ghostwords commented 9 months ago

Tranco's custom list generation UI includes an option to "Only include domains included in the Chrome User Experience Report of December 2023", but I don't see this option in their API. I previously requested this capability in https://github.com/DistriNet/tranco-list/issues/18, but I haven't yet heard back unfortunately.

jawz101 commented 9 months ago

Thanks @ghostwords And I'm sorry you had to correct my numbers thing... again. This machine you all built is such a fascinating concept. I wish I could program because privacy badger is the sort of little machine that replaces a lot of labor intensive work.

Would there be any merit to a temporary skip file based on previous scans? Say, if a site had no trackers, no new trackers, or errored out, skip testing it for 30 days.

Essentially an incentive for sites that do not track as well as allowing pb to move on down the list.

In my mind, I wonder what would happen if I scan every site in the world? How many reds and greens were recorded? At what number does scanning additional sites have a diminishing return? What made the greens not red?

ghostwords commented 9 months ago

Would there be any merit to a temporary skip file based on previous scans? Say, if a site had no trackers, no new trackers, or errored out, skip testing it for 30 days.

That's a cool idea, I'll open another issue.

It appears that the distribution of tracking domains follows some kind of logarithmic curve. The majority of trackers are discovered on the top sites and then there is a long tail. A few things complicate this picture, region-specific tracking domains for example.

ghostwords commented 7 months ago

Here is what this project's impact looks like in terms of blocked trackers and site errors:

Two charts that span from January 15th 2024 to April 15th 2024; one plots the number of blocked trackers, the other the percentage of sites where Badger Sett experienced some kind of error. The average number of blocked trackers goes up from under 600 to over 650, while the error rate drops from over 20% to under 10%.

The lines are seven-day moving averages. This is across all browsers; Firefox alone has higher blocked tracker averages as we can visit more websites successfully in a 24 hour period with Firefox. Conversely, Edge brings down blocked tracker averages.