Sec-CH-Bot for self declared 'good bots' that want to avoid analytics pollution

colinbendell commented 4 years ago

It is typical for many organizations to run headless browsers for periodic tests in production. This is done for many purposes including:

availability testing
integration tests & canary tests
performance synthetic monitoring (eg: Page Speed Insights, SpeedCurve, and HTTPArchive)

Since these services typically use webdriver or puppeteer to instantiate the headless browser and crawl web pages any beacons or analytics in production will be polluted with these crawls. Currently Google Analytics, Marketo and any other marketing analytics engine create sessions for all of these synthetic runs. For big companies this traffic can easily be ignored as rounding errors. However, for many smaller websites, this traffic can disproportionately skew the analytics.

For these known 'good' bots there should be a client hint that signals that this request is indeed a 'bot' and therefore the analytics and business metrics should be ignored or classified separately. Additionally, it should be enabled by default for these situations and shouldn't require any feature policy to incrementally reveal this attribute.

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
            HeadlessChrome/71.1.2222.33 Safari/537.36
Sec-CH-UA: "Chrome"; v="71"
Sec-CH-Mobile: ?0
Sec-CH-Bot: ?1

amtunlimited commented 4 years ago

Would definitely be an interesting use-case for client hints, although I'm not sure if a new hint is a good way to go. This set of hints is really just about having feature parity with the User-Agent header.

One problem is that you can't really trust these "I'm a good bot" type headers, because the bad bots will just lie like browsers lie about the User-Agent header now.

What would probably happen is that "HeadlessChrome" would show up in the Sec-CH-UA list, so it might look like this:

Sec-CH-UA: "Chromium"; v="99", "HeadlessChrome"; v="99"

quasicomputational commented 4 years ago

Good bots also have some kind of contact information in the UA string, typically a mail address or a HTTP URL (or both). It'd also be nice to be able to encode that in a principled way, since Client Hints are structured data anyway.

amtunlimited commented 4 years ago

Good bots also have some kind of contact information in the UA string, typically a mail address or a HTTP URL (or both).

Huh, TIL. Do you have an example?

colinbendell commented 4 years ago

The principle here is to have a structured method for good-bots to follow rather than expecting an analytics service to have to innumerate all the ever-evolving possibilities. Yesterday you would key off of googlebot+phantomjs, now we add headless-chrome and wpt, what about tomorrow? This is, of course, all best effort and not a solution for 'bad bot' detection.

On Mon, Aug 17, 2020 at 9:21 AM Aaron Tagliaboschi notifications@github.com wrote:

Good bots also have some kind of contact information in the UA string, typically a mail address or a HTTP URL (or both).

Huh, TIL. Do you have an example?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/WICG/ua-client-hints/issues/119#issuecomment-674877533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMMERPTOW5SMZ7EXPC2TCDSBEVEJANCNFSM4P2SXNVA .

quasicomputational commented 4 years ago

Huh, TIL. Do you have an example?

Googlebot: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Bingbot: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

It's a politeness thing, so that you can reach the operators if the bot's misbehaving or you have queries. Like @colinbendell said, this would be best effort and not intended for detection or anything like that; obviously, bad bots can lie.

amtunlimited commented 4 years ago

Interesting. The hard part is that, in order for a hint to be sent it has to be requested, which means a site/server would have to know to ask for the contact info.

amtunlimited commented 4 years ago

While I don't disagree that this would be super helpful in a post-dump-everything-in the-user-agent-header world, I don't know that Client Hints are the right mechanism. This would seem like it's own header

WICG / ua-client-hints

Sec-CH-Bot for self declared 'good bots' that want to avoid analytics pollution #119