alaz / legitbot

🤔 Is this Web request from a real search engine🕷 or from an impersonating agent 🕵️‍♀️?
Other
24 stars 11 forks source link

Fetch Googlebot IP ranges from their published JSON resource #142

Closed alaz closed 3 months ago

alaz commented 4 months ago

Google publishes the current IP ranges for Googlebot: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot#automatic

Of course Legitbot could fetch them with fetch:url, similarly to how it works for Ahrefs:

https://github.com/alaz/legitbot/blob/e5c8923cc9c00459b426a1c4a1f89da87875b5d3/lib/legitbot/ahrefs.rb#L6-L7

But we don't know the cadence of changes to this list and fetch:url updates the Legitbot sources. Even with the automatic detection in place, the change would have to wait until the next release.

In order to dynamically fetch Googlebot IP ranges from their published JSON, ip_ranges block can be used, similarly to how it works for Facebook:

https://github.com/alaz/legitbot/blob/e5c8923cc9c00459b426a1c4a1f89da87875b5d3/lib/legitbot/facebook.rb#L10-L19

We probably need fetch:url factored out from Rubocop cop sources though, so it can be easily accessible.

alaz commented 4 months ago

Though I have to add that I am against making pre-fetching the IP ranges list the default behaviour.

Currently implemented DNS-based approach is superior, because it relies on the DNS caching (including eviction). Only the first request may be slow, and all subsequent requests will utilise the cache. This somewhat increased latency of the first request is not a big deal for web crawlers and it does not affect human visitors.

Contrary, if someone wants to fetch IP ranges from an external resource, they would also be responsible for refreshing this list regularly using reload_ips.