WICG / lang-client-hint

Wouldn't it be nice if `Accept-Language` was a client hint?
Other
11 stars 7 forks source link

Thoughts and feedback #4

Open pierrefar opened 5 years ago

pierrefar commented 5 years ago

Hi,

This is the follow-up of the Twitter chat I had with Mike.

First of all, I'm very happy something is happening on this front. I've been working on different aspects of this problem for years. This proposal is a major step forward and I think we can make a major improvement here.

My approach is to explicitly break down the problem in two parts (the current draft mentions both) so we can reason about them:

  1. User experience
  2. Fingerprintability

For UX, two aspects worth dwelling on are the first navigation experience and the availability of the preferred language in JavaScript.

For first nav, right now the proposal does not allow for the client to give any hint to the server about its preference, forcing the server to decide its default. The Accept-Language (A-L) header does exactly that (with privacy implications), allowing the server to make a decision on each navigation, including the first request. To take on high-profile example, the homepage of the airline Ryanair responds to the A-L header; compare:

curl -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36' -H "Accept-Language: es;q=0.9" "https://www.ryanair.com/"

curl -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36' -H "Accept-Language: en;q=0.9" "https://www.ryanair.com/"

For availability in JS, it's very useful to have. For example, for a A-L-responsive webpage like Ryanair, some implementations use the client-side language preference to offer the user an 'escape hatch' if they happen to land on the 'wrong' language page; in the Ryanair example, if I ended up on the Spanish page, the website may prompt me in English to go to the equivalent en-GB URL they have. Another use case is cookie consent platforms that send the consent message in all languages as a big JS array that the code picks the right one to show on the client.

Fingerprintablity is much more interesting. Again, we need to break this down as to why fingerprinting is possible:

  1. The content of the header
  2. The format of the header

The content is the language preference list, with weights. The format is the everything about the syntax variations of a string that conveys the same information. There is absolutely no different between these headers:

` Accept-Language: es-ES,es;q=0.9, en

Accept-Language: es-ES, es;q=0.9, en

Accept-Language: es-ES, es;q=0.9,en

Accept-Language: es-es, es;q=0.9,en ` but of course they all betray whichever client generated them. This is something the Tor project and Mozilla have been thinking about for a long time; example.

So what to do? I'd like to propose the following:

  1. On first nav, client send the client's most preferred language (en-GB) or the generic version of the language (en) to the server. Low risk of fingerprinting.

  2. If the server has alternates: a. If it supports the user's preferred language, it can reply as is (the default worked after all), or b it sends the Accept-CH: Lang response, but with a list of languages it supports.

  3. The client parses the server response and checks if any of the languages in the list are supported by the client. In Mike's example in the current draft, the most preferred is en-US but the browser also supports German. If the server does not have English content and responds in, say, French as it is its default, it could also send an Accept-CH: Lang, de response header. The browser could then ask if the user wants to view the content in German or not. Bad UX, but possible, the browser makes this decision automatically on subsequent fetches.

A somewhat controversial suggestion: If we're going down the hinting path that servers need to support, I want to speculate we can push a server requirement too, and have a shared dictionary of fixed keys to language strings; something like 0x0 => 'en'. On my Linux machine right now, the /usr/share/i18n/locales directory has 341 files, so we're not talking large numbers at all. Assume we allow that, we can get rid of the fingerprintability completely in binary streams: we can envisage a stream format that does not need to include delimiters, and the capitalization is not relevant as we're sending numeric keys.