Closed alaz closed 3 months ago
Though I have to add that I am against making pre-fetching the IP ranges list the default behaviour.
Currently implemented DNS-based approach is superior, because it relies on the DNS caching (including eviction). Only the first request may be slow, and all subsequent requests will utilise the cache. This somewhat increased latency of the first request is not a big deal for web crawlers and it does not affect human visitors.
Contrary, if someone wants to fetch IP ranges from an external resource, they would also be responsible for refreshing this list regularly using reload_ips
.
Google publishes the current IP ranges for
Googlebot
: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot#automaticOf course
Legitbot
could fetch them withfetch:url
, similarly to how it works for Ahrefs:https://github.com/alaz/legitbot/blob/e5c8923cc9c00459b426a1c4a1f89da87875b5d3/lib/legitbot/ahrefs.rb#L6-L7
But we don't know the cadence of changes to this list and
fetch:url
updates the Legitbot sources. Even with the automatic detection in place, the change would have to wait until the next release.In order to dynamically fetch Googlebot IP ranges from their published JSON,
ip_ranges
block can be used, similarly to how it works for Facebook:https://github.com/alaz/legitbot/blob/e5c8923cc9c00459b426a1c4a1f89da87875b5d3/lib/legitbot/facebook.rb#L10-L19
We probably need
fetch:url
factored out from Rubocop cop sources though, so it can be easily accessible.