HTTPArchive / httparchive.org

The HTTP Archive website hosted on App Engine
https://httparchive.org
Apache License 2.0
334 stars 42 forks source link

Avoid hitting anti-bot pages #202

Closed rviscomi closed 4 years ago

rviscomi commented 4 years ago

Per https://discuss.httparchive.org/t/chapter-6-fonts/1761/5?u=rviscomi:

There’s another issue with the “most popular typefaces” btw., possibly affecting other sections, too. Some requests attributed to Open Sans are not actually requests related to the actual page they had been attributed to, but due to Cloudflare’s anti-bot protection.

For example: http://www.madereros.com/ 3 doesn’t use any web font at all, even though according to the Almanac data it does, but instead your crawler saw Cloudflare’s anti-bot page here (which uses Open Sans).

  1. How often do HA tests get blocked by Cloudflare? We should be able to detect this.
  2. Is it possible to get safelisted by Cloudflare? (@pmeenan @paulcalvano do either of you know?)
  3. Are we hitting any other anti-bot protections?
pmeenan commented 4 years ago

If you have a few test ID's for pages that were intercepted I can see about adding detection. We should treat those as failures and retry (and then exclude if continues to fail).

If we weren't also testing from GCE it might be possible to get IP whitelisted but as long as we are doing testing from the cloud, any whitelist would also open the door for actual bots.

rviscomi commented 4 years ago

Unable to reproduce this issue.