Crawl beyond the home page

EFForg / badger-sett

Automated training for Privacy Badger. Badger Sett automates browsers to visit websites to produce fresh Privacy Badger tracker data.

https://www.eff.org/badger-pretraining

MIT License

121 stars 15 forks source link

Crawl beyond the home page #25

Closed bcyphers closed 3 years ago

bcyphers commented 6 years ago

Currently, we only visit the home page of each domain in the scan. Many sites might have different kinds of trackers on different pages. We could modify the crawler to randomly "click" around on the different first-party links on each site it visits.

pipboy96 commented 5 years ago

@bcyphers Random clicking would make the results non-reproducible.

bcyphers commented 5 years ago

That's true, but results are already non-reproducible due to the nature of online advertising. Check out the history of results.json: every day a different set of trackers are discovered on the same set of sites.

The goal isn't to have a perfectly reproducible measurement, it's to get a representative sample of the trackers that a normal user would encounter, and I think clicking around would contribute to that goal.

pipboy96 commented 5 years ago

@bcyphers Instead of random clicking, look for the most visited areas of the website.

jawz101 commented 5 years ago

Rather than random clicking on the same top X sites I would pick a random set of sites each run. Say, if you're looking at the Majestic Million, the top entries will always be the same and they usually don't have many 3rd parties anyways (Wikipedia, Facebook, Google.com, etc.) And the first 10,000 or 100,000 thousand may not give much coverage outside of USA, Europe, China, and a few other major markets.

Randomness might be best and exclude .mil, .gov, .edu, .org just because they won't have nearly as many trackers as people trying to make a buck.