arcjet / well-known-bots

List of well-known bots and user-agent patterns to detect them
MIT License
28 stars 1 forks source link

Analyze traffic to see if Google ever sends Google-Extended #13

Open davidmytton opened 2 months ago

davidmytton commented 2 months ago

Google's AI crawler is Google-Extended

Google-Extended is a standalone product token that web publishers can use to manage whether their sites help improve Gemini Apps and Vertex AI generative APIs, including future generations of models that power those products. Google-Extended does not impact a site's inclusion or ranking in Google Search.

User agent token Google-Extended

See the full list at https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers

blaine-arcjet commented 2 months ago

It looks like ai.robots.txt includes this so this should also be resolved by #3

blaine-arcjet commented 2 months ago

The google docs state:

Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity.

This means that it only works via detection in robots.txt similar to Applebot-Extended.

@davidmytton Does this mean we should categorize all of the google crawler products as ai?

davidmytton commented 2 months ago

@davidmytton Does this mean we should categorize all of the google crawler products as ai?

Started a discussion: https://github.com/arcjet/well-known-bots/pull/16#discussion_r1762097854