Closed bcyphers closed 3 years ago
@bcyphers Random clicking would make the results non-reproducible.
That's true, but results are already non-reproducible due to the nature of online advertising. Check out the history of results.json: every day a different set of trackers are discovered on the same set of sites.
The goal isn't to have a perfectly reproducible measurement, it's to get a representative sample of the trackers that a normal user would encounter, and I think clicking around would contribute to that goal.
@bcyphers Instead of random clicking, look for the most visited areas of the website.
Rather than random clicking on the same top X sites I would pick a random set of sites each run. Say, if you're looking at the Majestic Million, the top entries will always be the same and they usually don't have many 3rd parties anyways (Wikipedia, Facebook, Google.com, etc.) And the first 10,000 or 100,000 thousand may not give much coverage outside of USA, Europe, China, and a few other major markets.
Randomness might be best and exclude .mil, .gov, .edu, .org just because they won't have nearly as many trackers as people trying to make a buck.
Currently, we only visit the home page of each domain in the scan. Many sites might have different kinds of trackers on different pages. We could modify the crawler to randomly "click" around on the different first-party links on each site it visits.