alaz / legitbot

🤔 Is this Web request from a real search engine🕷 or from an impersonating agent 🕵️‍♀️?
Other
22 stars 9 forks source link

Add missing Google crawlers #85

Closed alaz closed 1 year ago

alaz commented 2 years ago

List of the crawlers

inspire22 commented 1 year ago

I'm getting googlebot blocked quite a bit in my rack-attack logs using legitbot, it's probably because some IPs are missing? 95.216.227.158 95.216.33.117

Is it possible to automate the process of adding new IPs using the host command like they suggest here? https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot

I actually had more IPs in there but calling 'host' on them I realized they were actually fake & I'd just randomly had the first ones I tested be actually googlebot.

alaz commented 1 year ago

@inspire22 Legitbot follows the exact verification procedure you linked to, only programmatically. Did you try to follow the steps? These IPs do not pass for me:

$ host 95.216.227.158
158.227.216.95.in-addr.arpa domain name pointer crawl-95-216-227-158.googlebot.com.
$ host crawl-95-216-227-158.googlebot.com
Host crawl-95-216-227-158.googlebot.com not found: 3(NXDOMAIN)

$ host 95.216.33.117
117.33.216.95.in-addr.arpa domain name pointer crawl-95-216-33-117.googlebot.com.
$ host crawl-95-216-33-117.googlebot.com
Host crawl-95-216-33-117.googlebot.com not found: 3(NXDOMAIN)
inspire22 commented 1 year ago

Oops, you're right, thanks! Strange they would match the first step and not the second.

I'd mistaken your TODO to add crawlers for adding crawler IPs, which is why I jumped on here. My bad and apologies :)

alaz commented 1 year ago

By the way, I don't think these IPs belong to Google. Both of them are owned by Hetzner (a well known European hosting provider):

$ whois 95.216.227.158
…
route:          95.216.0.0/16
org:            ORG-HOA1-RIPE
descr:          HETZNER-DC
…

$ whois 95.216.33.117
…
route:          95.216.0.0/16
org:            ORG-HOA1-RIPE
descr:          HETZNER-DC

Strange they would match the first step and not the second.

Someone managed to convince Hetzner to create these reverse DNS records (I am surprised). Faking corresponding forward records is close to impossible, as Google itself controls the zone.