WICG / ua-client-hints

Wouldn't it be nice if `User-Agent` was a (set of) client hints?
https://wicg.github.io/ua-client-hints/
Other
590 stars 77 forks source link

Sec-CH-UA randomization vs. Sec-CH-UA-Engine #52

Closed scottlow closed 4 years ago

scottlow commented 4 years ago

@yoavweiss and I had an offline discussion today where we agreed that we needed additional community feedback on how browser equivalence classes are defined. During this conversation, we identified two potential paths forward, each with their own trade-offs:

Sec-CH-UA randomization

This is the way that the spec is authored currently and involves GREASE-ing the Sec-CH-UA set to ensure that sites cannot create block lists for unknown tokens in the set. By itself, however, it does not address the commonly seen case where sites create allow lists of known per-browser tokens to enable certain features. In order to combat this, the current proposal is to have browsers pretend to be other browsers in their equivalence-set by sending other browsers' Sec-CH-UA sets in place of their own for a small number of navigations.

Advantages

Disadvantages

Sec-CH-UA-Engine

This is a proposal that has been mentioned in various issues (#4, #7, #21, #29) that involves creating a new Sec-CH-UA-Engine hint that would describe a browser's underlying engine and would be sent in place of the Sec-CH-UA hint by default. The idea is that this would allow developers to target browser equivalence classes by default, while still allowing them to target individual browsers (perhaps with some penalization due to Privacy Budget) by using the Accept-CH header to request a per-browser token using the Sec-CH-UA hint.

Advantages

Disadvantages

Unknowns

In short, we'd appreciate community feedback on this issue to help drive the best outcome for the web. If you have feedback, data, or other suggestions that could help shape the future of this feature, please feel free to join the discussion!

jridgewell commented 4 years ago

Another disadvantage to randomness is the increased Vary variance. Low variance is desperately needed for caching differentially served responses at edge networks.

Sending Engine by default would be excellent. But I imagine this will allow devs to target specific browsers, and possibly force other browsers to start lying to get better compatibility.

Would Engine include version information as well? Or is it limited to just Blink/Gecko and v8/SpiderMonkey?

yoavweiss commented 4 years ago

Another disadvantage to randomness is the increased Vary variance. Low variance is desperately needed for caching differentially served responses at edge networks.

As you suggested on that issue, limiting the randomness to be fixed per browser version (so that each version changes the string, but all in all, we have 1 or 2 buckets per browser version) would solve/mitigate that issue.

Sending Engine by default would be excellent. But I imagine this will allow devs to target specific browsers, and possibly force other browsers to start lying to get better compatibility.

That's indeed what I'm concerned with. Making the engine more prominent can lead to engine block/allow lists, which is arguably not better than browser block/allow lists.

Would Engine include version information as well? Or is it limited to just Blink/Gecko and v8/SpiderMonkey?

That's what I envisioned. "Chromium"; v="82", etc.

yoavweiss commented 4 years ago

@miketaylr @foolip - I'd love to hear your opinions on the above

yoavweiss commented 4 years ago

Also @slightlyoff, which I hear has opinions on things

scottlow commented 4 years ago

I should also mention that there may be a third option here (as explained in #54) that discourages allow/block lists from a more technical perspective.

Steve51D commented 4 years ago

The randomization solution seems very strange to me. What is Sec-CH-UA actually for? If it's for telling the website what browser is being used then partially randomizing it defeats it's purpose.

By itself, however, it does not address the commonly seen case where sites create allow lists of known per-browser tokens to enable certain features.

  • Can encourage creation of engine based allow/block lists and potentially favor more popular browser engines if sites start creating allow/block lists that restrict to certain browser equivalence classes

I'm not sold on these allow/block lists being a problem in the first place. If I'm a hobbyist or running a small business and decide that I want to create a new website but I'm only going to support the most popular browser (or engine) as a time/cost-saving measure then why is that anyone else's concern?

Yes, it goes somewhat against the open principles of the web but I don't see why trying to enforce that openness is the responsibility of the browser platform.

It seems to me that Sec-CH-UA should simply tell the site what the browser is. Whether you have:

Sec-CH-UA - "Chrome"; v="70", "Chromium"; v="70"

or

Sec-CH-UA - "Chrome"; v="70"
Sec-CH-UA-Engine - "Chromium"; v="70"

Seems like a moot point to me. If anything, I would say it makes more sense to reverse things to expose a smaller passive fingerprinting surface:

Sec-CH-UA - "Chromium"; v="70"
Sec-CH-UA-Browser - "Chrome"; v="70"

I.e. on the first request, only the engine is sent as this may well be all the site wants or needs to know. If the site wants to request the actual browser then it can do so.

foolip commented 4 years ago

Sec-CH-UA-Engine is an interesting proposal, and a benefit is that at least initially there would be no need for one engine to pretend to be another.

However, I think the dynamics in the longer term are going to be the same as for Sec-CH-UA. If this mechanism were in place and WebKit were launched today, it would likely pretend to be both KHTML and Gecko, just as it does in the UA string. Similarly, EdgeHTML would pretend to be Chromium.

Looking forward, any fork of Chromium would certainly claim to be Chromium. If a wholly new engine comes along, it would likely also have to pretend to be one of the existing engines to get off the ground.

In other words, I don't see this as avoiding the need to present a set of tokens and throwing in random tokens.

othermaciej commented 4 years ago

Safari even today has site-specific UA string quirks where we pretend to be Chrome or Firefox (because some sites have a UA string lockout or conditional feature but work fine with a fictional UA sting). Under this new model, on those sites we'd probably need to claim to be Gecko or Chromium respectively in addition to Firefox or Chrome.

I think the randomized token list (as an incentive to search for inclusion of a tag) might help a little. Note though, it only helps if it's required, not totally optional as currently written, as I suggested in #60 )

However sites might have a priority list of UA tokens to look for. For example, once they have decided a browser is Safari because that tag is present, they won't believe it's Firefox. In which case we'd have to (still) send completely fictional values, instead of half-true values.

Overall I am not sure there's any solution that would let browsers make compatibility claims successfully, while also always honestly reporting their actual brand.

Steve51D commented 4 years ago

@othermaciej That is interesting. I'm surprised there are sites both significant enough and lax enough to require that kind of work-around from a major browser vendor.

Ideally, I would think that such issues are more a problem for the site to resolve than for the browser. However, I can also see the problem from the other side. If you have users of your browser saying 'major site x doesn't work on this browser' then that's a problem for you as well.

I see this as a separate issue to the spec of the Sec-CH-UA header though. That header is either for telling the website what the browser is or it's not. If it's not (or it's randomised to the point of being useless for that) then what is it's purpose?

othermaciej commented 4 years ago

@Steve51D there have been times when even some Google web properties require such a workaround (because there's a site or feature lockout but site actually works fine with a different UA string).

It's even worse for WebKit-based browsers on other platforms, for example Epiphany.

I think if Sec-CH-UA exists in any form, it's almost certain that browsers will continue to need to send entirely fictional contents to some sites. I don't think Sec-CH-UA-Engine helps, we'll just need to lie about that too. I think randomization maybe helps a little, since it might enable only partial lying on some sites.

scottlow commented 4 years ago

It's even worse for WebKit-based browsers on other platforms, for example Epiphany.

@othermaciej Are these types of issues mostly caused by the fact that Epiphany has its own unique UA token? If so, this is what I was thinking Sec-CH-UA-Engine could help solve. For example, if all UA client hints exposed by default was a way for Epiphany to say "Hey, I'm just another WebKit-based browser!" my hypothesis is that they would encounter less compatibility issues than if we instead exposed by default a specific browser brand that sites could recognize/code against.

Of course, neither solution is perfect. As you say, lying would likely exist in both cases (unless we pursued a way to technically discourage the use of UA tokens in allow/deny lists such as #54), but it seems that exposing equivalence class targeting by default could help smaller browsers based on larger engines avoid compatibility pitfalls caused by exposing a per-browser brand identifier by default.

othermaciej commented 4 years ago

In the existing UA string, there are both Safari tokens and WebKit tokens. Many sites seem to check for the Safari token, not the WebKit one. I don't think this would change if the same info was refactored into two separate header fields.

scottlow commented 4 years ago

@othermaciej that makes sense since most browsers expose WebKit in their UA today. If the default value exposed through UA client hints was an equivalence class instead of a per-browser identifier, however, I do wonder if we'd see more developers use the default (i.e. target all WebKit-like browsers) instead of compiling these equivalence classes themselves by using Accept-CH: UA to get specific brand info and building allow lists from that.

jyasskin commented 4 years ago

I think there's some comprehensibility benefit here even if, for example, WebKit has to send Sec-CH-UA-Engine: Chromium, WebKit for compatibility, and sites can keep their by-browser block lists by opting into Accept-CH: UA.

Specifically, Sec-CH-UA is specified to be "unlike most Client Hints" right now, sending one thing by default and another after an Accept-CH: UA. Adding Sec-CH-UA-Engine would let us say that that's sent by default, and Sec-CH-UA is only sent (with the detailed browser information) after an Accept-CH: UA.

othermaciej commented 4 years ago

Exposing engine by default and exact browser on opt-in would probably be an improvement. Would this be sufficient for the stats-gathering purpose of browser knowledge?

(I think WebKit would continue to say WebKit by default and would say Chromium only for sites where it's needed to bypass a UA lock.)

othermaciej commented 4 years ago

Oh, here's a complication. For bug workarounds, sites often need a version, so Engine field might need versioning. But WebKit has only a frozen version in the current UA string. I guess we could duplicate the Safari version as the WebKit version, but this would be weird for WebKit clients that are not Safari, particularly on non-Apple ports.

jyasskin commented 4 years ago

I'm not an expert on what's needed for stats gathering, but my guess is that my suggestion is not sufficient for statistics.

I liked the suggestion in https://github.com/w3ctag/design-reviews/issues/467#issuecomment-583562415 to send detailed information on X% of requests. I'm not sure exactly what constraints we'd want on the choice of requests to minimize fingerprinting information: maybe roll the die each time a top-level page load starts with no storage?

Sites like gs.statcounter.com could also (ask their embedder to let them) just send Accept-CH: UA and take the hit in measurements of active fingerprinting.

scottlow commented 4 years ago

For bug workarounds, sites often need a version, so Engine field might need versioning

I believe #53 is tracking the splitting of version into its own CH. Would that opt-in work for sites that require versioning?

Sites like gs.statcounter.com could also (ask their embedder to let them) just send Accept-CH: UA and take the hit in measurements of active fingerprinting.

That was my thinking as well.

Steve51D commented 4 years ago

Would this be sufficient for the stats-gathering purpose of browser knowledge?

It probably depends who you ask.

Overall, it seems like a good compromise between the competing interests.

yoavweiss commented 4 years ago

I talked to @torgo yesterday, and he raised some good points which seem relevant to this thread. The analytics use case is an extremely important use-case for browser vendors, especially for what you may call "minority browsers". Being able to prove their own market share can influence developers to care enough to test their sites on those browsers, as well as directly impact various aspects of their business.

Beyond that, I'm concerned that over-indexing on "engine" would limit the future forkability of rendering engines, and enforce undesired conformity between different browsers that all use the same engine. As it stands, it's possible and likely for such different browsers to differ in the features they enable or disable, and they are also free to apply their own patches on the engine in the versions they ship. All that would be harder if server-side differential serving would assume different browsers with the same engine are all identical.

An approach where we have Sec-CH-UA represent a set which includes both the browser brand and its engine seems significantly safer from that respect.

scottlow commented 4 years ago

The analytics use case is an extremely important use-case for browser vendors

I completely agree, which is why I have some reservations about browsers pretending to be other browsers some fixed percentage of the time. It seems like this could cause share measurement to become difficult, as it would be less clear how much share came from a particular browser itself versus another browser with more share pretending to be that browser.

If we were to pursue the Sec-CH-UA-Engine approach, I'd fully expect analytics companies and sites that wanted to track individual browser traffic to ask for the Sec-CH-UA header per @jyasskin's comment above.

Beyond that, I'm concerned that over-indexing on "engine" would limit the future forkability of rendering engines, and enforce undesired conformity between different browsers that all use the same engine.

I'm not sure I follow the concern here. Given that there's no technical limitation preventing developers from creating allow/block lists, I expect that we will eventually arrive at a future where any new or forked engine will have to include the name of a more popular engine in its Sec-CH-UA-Engine hint for compatibility reasons. While unfortunate, such browsers could still differentiate themselves (thus providing sites with a signal to differentiate them based on the engine patches they apply, for example) by including a unique brand in their opt-in Sec-CH-UA hint.

The code that sites would need to write to detect browsers based on such engines wouldn't change; they'd still need to parse the unique brand token in either case. The only difference is that in the Sec-CH-UA-Engine world, there would be an additional hurdle for developers cross before receiving a per-browser identifier, something that I believe would reduce the number of compatibility issues caused by allow lists created purely from per-browser identifiers.

All that would be harder if server-side differential serving would assume different browsers with the same engine are all identical.

Is this assuming that a second round trip will be required on first navigate before sites will receive the client hints they've opted into? If so, then yes, I agree that the Sec-CH-UA-Engine approach causes some problems as first load scenarios would not be able to receive the unique brand information described above. That being said, I've seen a number of folks on various threads who are motivated to solve this limitation as it will have a substantial impact on scenarios beyond this one; it'd be great to explore solutions in this space!

An approach where we have Sec-CH-UA represent a set which includes both the browser brand and its engine seems significantly safer from that respect.

TL;DR, my main concern with exposing both brand and engine in a single hint is that it does not move the needle very far from where we are today. While UA client hints in general will transform the UA string from a passive fingerprinting surface to an active one (something I am super supportive of), exposing both fields in a single hint by default doesn't seem like it will inspire developer change.

We could certainly provide guidance encouraging developers to detect the engine field in Sec-CH-UA by default, however that feels like a repeat of providing per-browser identifiers in the UA string but encouraging developers to leverage feature detection as a best practice. Adding a hurdle (in the form of requiring developers to opt in to Sec-CH-UA) between sites and per-browser identifiers seems like it would encourage browser detection mechanisms with less compatibility risks while still allowing important use cases such as share tracking to function deterministically on the web.

yoavweiss commented 4 years ago

Thanks all for the ongoing discussion. After talking to folks and thinking about this some more, I think the best approach would be something along those lines:

yoavweiss commented 4 years ago

Closing as I didn't hear any objections to my conclusions. Please let me know if there's something more to discuss here and I'll reopen.