WICG / ua-client-hints

Wouldn't it be nice if `User-Agent` was a (set of) client hints?
https://wicg.github.io/ua-client-hints/
Other
583 stars 75 forks source link

Missing evidence for core problem addressed by specification #215

Open ronancremin opened 3 years ago

ronancremin commented 3 years ago

The UA Client Hints proposal appears to be founded on the premise that passive fingerprinting is widespread and harmful, and thus is worth solving:

"Version numbers, platform details, model information, etc. are all broadcast along with every request, and form the basis for fingerprinting schemes of all sorts." (https://wicg.github.io/ua-client-hints/, March 17th 2021)

The proposal suggests that "Rather than broadcasting this data to everyone, all the time, user agents can make reasonable decisions about how to respond to given sites' requests for more granular data, reducing the passive fingerprinting surface area exposed to the network."

"Form the basis" is a strong statement and implies an accepted fact. What is absent from the proposal is any evidence for how widespread or harmful this practice is. The EFF's Panopticlick is often cited as evidence but this is mostly a demonstration of what's possible by combining both passive and active fingerprinting. It says nothing about how widespread the practice is. The W3C's Mitigating Browser Fingerprinting in Web Specifications document cites numerous academic studies of fingerprinting on the web but there is scarcely a mention of passive fingerprinting, nor any mention of how widespread it is.

Thus this proposal may be solving a theoretical rather than real problem. Set against this is the fact that this proposal constitutes a significant change to web standards in place for over 2 decades and the ecosystem that evolved on top of it. Furthermore, there has been significant disagreement over whether Client Hints in general constitute a worsening of user privacy on the web:

Browser makers are well placed to understand their users needs and already have the ability to make reasonable decisions about what level of entropy to expose to websites via their User-Agent strings, as evidenced by Firefox, Brave etc.

jyasskin commented 3 years ago

I suggested in https://github.com/WICG/ua-client-hints/issues/214#issuecomment-804220234 that we have this spec cite https://w3c.github.io/fingerprinting-guidance/#avoid-passive-increases because it explains that "Passive fingerprinting allows for easier and widely-available identification, without opportunities for external detection, …". I think that explains why there's not much public data on how passive fingerprinting is used: by being passive, it evades such detection. That doesn't mean we should do nothing until a whistleblower comes forward, as you're arguing. We should work to improve the detectability of the available fingerprinting methods, which is what UA client hints do.

This is a distinct purpose from @pes10k's issues, especially https://github.com/httpwg/http-extensions/issues/767. He was objecting to moving active fingerprinting mechanisms from javascript to HTTP headers. The specific use of client hints to (eventually) replace the UA string instead moves a passive fingerprinting mechanism in an HTTP header, to an active mechanism in a header.

jwrosewell commented 3 years ago

For a harm to occur there has to be a victim. If there were victims they would come forward and there would be "public data" to use as evidence to justify the proposal.

Unless there are victims that have been harmed and that harm can be tied to the User-Agent the problem is at best theoretical.

The proposal should be amended to remove references to passive fingerprinting.

ronancremin commented 3 years ago

It seems odd to base a proposal on an assumption, all the more so given the dominance of HTTPS connections where only parties designated by the destination page gain access to headers. What other harm should we assume is happening?

And if passive fingerprinting really is a problem, rather than doing nothing I'm suggesting that there is already a perfectly good mechanism available to browser makers to decide how much entropy to reveal, and that this mechanism is already quite widely used. The User-Agent header is a SHOULD requirement since HTTP 1.0—browser makers are free to decide what to put in it in order to serve their users best.

jwrosewell commented 3 years ago

@miketaylr can you advise when we can expect the evidence requested to be provided? If you are not minded to provide any evidence can you please provide positive confirmation.

I've observed in my analysis of FLoC for this comment an absence of evidence to justify Privacy Sandbox proposals. I'm keen to avoid investing even more of my company's money and time on UACH. This could be spent conducting more beneficial improvements and features.

miketaylr commented 3 years ago

@miketaylr can you advise when we can expect the evidence requested to be provided? If you are not minded to provide any evidence can you please provide positive confirmation.

I'm wary to estimate timelines for any spec issue -- there are competing demands for time and attention, and sometimes standards work is slower than we all would like, unfortunately. But I would like to come to a resolution to this issue, one way or the other, before we progress the spec along the standards track.

miketaylr commented 3 years ago

And if passive fingerprinting really is a problem, rather than doing nothing I'm suggesting that there is already a perfectly good mechanism available to browser makers to decide how much entropy to reveal, and that this mechanism is already quite widely used. The User-Agent header is a SHOULD requirement since HTTP 1.0—browser makers are free to decide what to put in it in order to serve their users best.

@ronancremin I agree with you here -- browsers can reduce the granularity of information available in the User-Agent header by default. But for use cases that require more entropy, UA-CH can provide an (active) opt-in mechanism for sites to request it from the user agent.

miketaylr commented 3 years ago

Here's an interesting article that talks about certain companies pitching passive HTTP-based fingerprinting, as a workaround for privacy policies: https://digiday.com/media/the-elephant-in-the-room-companies-persist-with-fingerprinting-as-a-workaround-to-apples-new-privacy-rules/.

“Vendors won’t talk about this as fingerprinting, but you can pick apart what they mean by the language they use,” said the head of data partnerships at a global media agency who was not authorized to speak to Digiday.

It starts by asking about how the data is collected, said the exec. “Sometimes the vendor might say they have a series of HTTP information about a person’s device — that HTTP mention is the red flag that what that company is doing is server-side fingerprinting,’ the exec said.

In this case, it's not academic research about the possibility covert tracking, but speaks to companies trying to offer this as a service today.

ronancremin commented 3 years ago

Thanks for the article.

It's difficult to be 100% sure about the exact data flow described but it seems pretty clear that it's describing a situation where an avertising company's SDK communicates with that company's back end servers, and then those servers in turn chose to pass the received information on to a third party. This would happen outside of the purview of Apple and the user would be oblivious.

While I agree that this is fingerprinting, it certainly isn't passive—it is a deliberate choice of the advertising company.

miketaylr commented 3 years ago

I think maybe we're not aligned on the terminology of passive vs active (and admittedly they could be seen as overloaded terms). Passive, in the context of fingerprinting, means that it can happen without the user agent's knowledge -- a browser, or user for that matter, has no idea what a company might do with its HTTP logs to create a fingerprint, for example. Active means there is some code that is run to probe characteristics of a device or browser to create a fingerprint. User agents do have the capability to know what APIs are being called in this instance.

(which is more or less what https://w3c.github.io/fingerprinting-guidance/#passive and https://w3c.github.io/fingerprinting-guidance/#active says).

jwrosewell commented 3 years ago

I see the following issues with this article and the approach advocated.

  1. The article is dated 12th April 2021. I had assumed there is evidence that predates the proposal that the proposer would have available given the amount of effort that Google have expended and the ask they are making of other participants in the web ecosystem.

  2. The article does not acknowledge the lawful basis for probabilistic identifiers (aka fingerprints using words that do not make the reader think of crimes). The Information Commissioners Office (ICO) has published guidance which makes it clear such identifiers might be personal data. As such they are perfectly usable so long as they are explained in the privacy policy and consented to if another legal basis is not available. It is a shame the unnamed executive did not ask the sales representative for a wording of the privacy policy that would be required to use their service, and if not compliant report the company to the relevant data protection authority. It is a shame the journalist did not ask their source why they didn't take this action. We must work together to eliminate bad actors and a baseless narrative.

  3. The ICO are about to consult on amending their guidance on anonymisation. This review will be relevant to this proposal.

  4. We can conclude it’s perfectly legal for companies to offer probabilistic identifiers today. Fraud detection is a good example of a use case.

  5. The article does not provide evidence of specific harms which support the proposal.

In relation to the draft fingerprinting document edited by Nick Doty referenced. That document, and the ones it references, do not provide evidence of specific harms, only theoretical harms without acknowledging the legal basis and benefits of probabilistic identifiers.

I agree with Michael Zacharski in this article.

“I think as a whole the industry needs to do more to educate consumers both about their rights under current privacy laws as well as about the tech that powers the internet which we all benefit from,” said Michael Zacharski, CEO at Engine Media Exchange. “The trade-offs that happen are necessary to keep the economy of the open internet working — and the monetization for content creators needs to be made part of the privacy conversation.”

jwrosewell commented 3 years ago

It is also worth noting that the UK's ICO and CMA are now working together concerning the justification for privacy based changes to the web. See this joint ICO and CMA announcement issued on 19th May 2021.

@miketaylr - I assume you are representing Google rather than yourself in this forum as @yoavweiss the original active Google engineer and proposer seems to have become less active of late. Correct?

If so will Google be presenting any information to this forum to justify this proposal/specification? I assume such information would exist within Google prior to expending the time and energies of their elite engineers.

jwrosewell commented 2 years ago

@miketaylr Can you advise when and where the proposed discussion will take place? Tagging @ronancremin and @jonarnes who will likely also be interested.

I observe that the current draft of the document still contains the following in the introduction.

In practice, however, this header’s value exposes far more information about the user’s device than seems appropriate as a default...

Focusing purely on the singular issue of the many tens of millions of developer hours that will be required to upgrade data models the eco-system deserves a more thorough and evidence backed justification.

Further, given the facts contained in the Google Digital Advertising Antitrust Litigation filing, and the relationship between this proposal and Privacy Sandbox, it's essential to demonstrate this is a net improvement to the web and not just browser vendors. At the moment the document reads as if the justification is self-evident which it is not.

jwrosewell commented 2 years ago

It's a shame this issue and other related issues were not raised in the TAG review on User-Agent Client Hints & UA Reduction. @miketaylr can you advise when the community will be able to review the evidence used? When will the discussion proposed be scheduled?

jwrosewell commented 1 year ago

Issues #314 and #315 advises that there is no impact associated with fingerprinting from the UA-CH changes. Google cited these issues in their October 2022 quarterly report to the CMA. Google also repeated them on their web pages that relate to User-Agent reduction. I therefore believe Google consider this evidence credible.

The abstract of the UA-CH proposal states a goal of the document and the associated changes is “avoiding the historical baggage and passive fingerprinting surface exposed by the venerable User-Agent header”. If true, then the evidence in #314 and #315 confirms that the proposal is not fit for purpose as so-called "fingerprinting" is unaffected by the proposal.

Further the Information Commissioners Office engaged Plum Consulting to review the literature associated with data protection harms. No literature associated with fingerprinting harms was identified. The report is available here. Further research is recommended.

It now appears as if the disruption and competition harms associated with the deployment of UA-CH and the associated User-Agent Reduction is now no longer justified as it does not meet the stated goal of the proposal, and neither does it have any justification as previous commented on against this issue and evidenced in the ICO/Plum review.

Please can Google (tagging @miketaylr, @cwilso, @yoavweiss) provide substantive commentary in the January 2023 quarterly report provided to the CMA and the industry under the commitments which I believe Google employees at W3C have now been trained in.