explainers-by-googlers / reduce-accept-language

This repository hosts explainer for reducing passive fingerprinting in the Accept-Language header.
Creative Commons Attribution 4.0 International
16 stars 0 forks source link

I18N objections to reducing accept-language #10

Open aphillips opened 1 year ago

aphillips commented 1 year ago

The W3C Internationalization Working Group discussed this proposal in our 2023-03-09 teleconference and I was actioned with creating this response.

The I18N WG is concerned that reducing the accept-language header to just the first entry, while perhaps helpful in reducing fingerprinting, will have a potentially negative impact on multilingual users. We are especially concerned with the potential impact on the speakers of minority languages. A minority language speaker of a language may find that many sites do not support their language and thus they may desire to specify a second language (generally a more common one) to ensure the best match for their preferences.

For example, a speaker who prefers Breton (br) [Breton is a regional language found mainly in France] is likely to also speak French (fr, or perhaps fr-FR). They would thus like to have an A-L header something like:

Accept-Language: br,fr-FR;q=0.8

If the A-L header is reduced to a single entry, they would have to choose either French or Breton for their requests. If they chose Breton, they might find some sites defaulting them to another language (such as English), even when French is available.

We also note that for most users of most browsers the A-L header is usually set to a single entry matching the system runtime locale. Most users do not tailor this configuration. The users who do use the browser's user experience to modify their preferences are taking a specific positive action to assert what they want their browser to send on their behalf. Additional privacy warnings could be provided to users there, but users ought to be allowed to control what their browser emits on their behalf in order to receive the best possible Web experience.

Thanks!

Tanych commented 1 year ago

thanks for the information. We are not reducing the accept-language header to just the first entry. What we do is reducing the accept-language to the most preference representation, and only browser knows the full preferences list of a users, and browser will do the language negotiation to send the most preference language to the sites. For your example, if the sites don't care about the Accept-Language header, the header will reduced to the first entry br. if sites care about users' accept-language header, sites need to send variants in the response header to specify what languages they support. e.g. variants: accept-lagnauge =(en, fr). This means, sites supports en, fr. In this case, browser will send a request with accept-language fr-FR to the sites to get a best representation page for the users.

hanguokai commented 1 year ago

The negotiation requires backend support. Servers all over the world need to change for this. This is a million times more difficult than a single browser to make changes.

And Websites that only rely on client side I18N (if server don't support this) can't do negotiation.

Tanych commented 1 year ago

The majority multilingual sites don't care about what actually set in the accept-language header. Most sites rely on the geo-location or ui-driven to redirect users to the correct language page. we know this require sites backend change if sites care about the accept-language header. can you elaborate that sites rely on client site i18N can't do negotiation? thanks.

hanguokai commented 1 year ago

Most sites rely on the geo-location or ui-driven to redirect users to the correct language page.

It is useful to provide language selection on UI. But this proposal mainly affects the default display language of the site, if you need to manually adjust the language of the site every time you visit a new website, it is too troublesome. By the way, there are not many websites that change the language based on the user's geographical location.

we know this require sites backend change if sites care about the accept-language header. can you elaborate that sites rely on client site i18N can't do negotiation? thanks.

Static generated sites, e.g. created by Jekyll or Hugo, are just static files. They can be deployed on any web server. They usually use a pure client-side approach to achieve multi-language support, without relying on the server.

Tanych commented 1 year ago

I mean the change only impact sites care about the accept-language header to sent the right language page for the users, it's not make sites based on UI to implement the multilingual feature.

Static generated sites, e.g. created by Jekyll or Hugo, are just static files. They can be deployed on any web server. They usually use a pure client-side approach to achieve multi-language support, without relying on the server.

are those tools care about the accept-language header in the requests or using js interface to get users' languages preference?

hanguokai commented 1 year ago

What I understand about "ui-driven to redirect users to the correct language page" is a language select list on the site's UI.

are those tools care about the accept-language header in the requests or using js interface to get users' languages preference?

Doesn't rely on the accept-language header. Use navigator.languages on client side to pick a best language as default language. Also provide a language select list on UI.

Tanych commented 1 year ago

what first consider reduce the accept-language header. For the JS interface, what probably need to handle different as most feedback or concerns are related to this. We will double check and review the impacts once we actually rolling out the change.

hanguokai commented 1 year ago

I think there are many websites who care about the accept-language header. It may be a very long process for them to support the negotiation as I said before.

aphillips commented 1 year ago

The majority multilingual sites don't care about what actually set in the accept-language header. Most sites rely on the geo-location or ui-driven to redirect users to the correct language page.

Some sites, including some quite large sites, depend on A-L as at least a hint to providing the right user experience for new users, even with geo-location or other UI. That's not the same as solely relying on the header's contents. Language negotiation and management for users is not entirely simple.

We are not reducing the accept-language header to just the first entry. What we do is reducing the accept-language to the most preference representation, and only browser knows the full preferences list of a users, and browser will do the language negotiation to send the most preference language to the sites

This is helpful. How will the browser know how to reduce the requested set?

I think there are many websites who care about the accept-language header. It may be a very long process for them to support the negotiation as I said before.

+1 ... but change has to start somewhere. Could sites support legacy headers while also implementing a negotiation scheme?

Tanych commented 1 year ago

yea, I don't think we will land any change in a short time. We need to collect enough feedbacks to slowly rolling out.

Could sites support legacy headers while also implementing a negotiation

site can support both reduce accept-language header or full accept-language header, if they care about the language preference in the header. see demo sites: https://developer.chrome.com/blog/origin-trial-for-accept-language-reduction/#demo

Tanych commented 5 months ago

@aphillips I'm trying to bring back the discussion with the i18n group. I have some questions:

miketaylr commented 4 months ago

@aphillips any comments on how Safari is handled? I haven't been able to discover a lot of i18n bugs reported against WebKit given their current behavior.

Static generated sites, e.g. created by Jekyll or Hugo, are just static files. They can be deployed on any web server. They usually use a pure client-side approach to achieve multi-language support, without relying on the server.

@hanguokai if we exposed a <meta> element to also set available languages, this could be handled easily by a static site generator.

sffc commented 3 months ago

Two things I want to point out here.

First, web sites definitely do use Accept-Language information, including fallback languages. It is quite easy to see how this is done. For example, in Chrome:

  1. Create a new clean profile.
  2. In Settings, choose your languages. I set mine to the following: Māori, Spanish, English (in that order).

I hit several sites with my local IP, which is currently somewhere in California, USA. Among the sites that obeyed my Accept-Language:

  1. wikipedia.org gave me a Māori UI and suggested Spanish and then English among other subdomains I could then visit.
  2. youtube.com gave me a Spanish UI. Other Google properties also give me Spanish UIs.
  3. facebook.com gives me an English UI, which I assume comes from my IP. Other Meta properties also give English.
  4. twitter.com gives me Spanish
  5. Many E-commerce sites give me English; perhaps they infer a stronger correlation with GeoIP for sales purposes. However, there are numerous counter-examples such as booking.com that gave me Spanish.

It is satisfying as an i18n engineer to see web sites do the right thing here.

Second, fallback language can be useful for inferring other locale extensions. For example, a common Accept-Language may be something like eu-ES,en=0.9: you speak Basque (Spain) but with English fallback. With this header, even if web sites don't support Basque, they can infer that you are more likely to prefer ES regional preferences for hour cycle, measurement units, calendar system, etc., than US regional preferences (default for en). This inference is far from perfect, of course, which is why we have a separate proposal to allow users to customize these preferences. However, eu-ES,en=0.9 is objectively more useful than either eu-ES or en when deriving locale-related preferences.

I haven't been able to discover a lot of i18n bugs reported against WebKit given their current behavior.

I can't speak to Safari's decisions regarding Accept-Language, except to point out that it's well established that users don't often report i18n-related issues; they assume it's their fault that something isn't working. If a user's true Accept-Language is something like eu,es=0.9,en=0.5, they might set their browser's locale to es in order to get the best experience based on what they know how to do, losing out on content on the open internet that might be tailored in their actual preferred language.

I have pointed @Constellation to this discussion to weigh in from WebKit's point of view.

FrankYFTang commented 3 months ago
  • how does the i18n works on safari since it only contains one language?

I think this is a false statement that "safari since it only contains one language". According to my research, Safari outputs more than one language. For example, the followings are all output from Safari:

Accept-Language: hi-IN, hi;q=0.9
Accept-Language: zh-TW, zh-Hant;q=0.9
Accept-Language: zh-HK, zh-Hant;q=0.9 
Accept-Language: zu,en-US;q=0.9,en;q=0.8
Accept-Language: ga,zh-HK;q=0.9,zh-Hant;q=0.8
Accept-Language: gu-IN,gu;q=0.9,hi-IN;q=0.8,hi;q=0.7

From all I can see, Sarfari always includes one or more fallback in the Accept-Language header. Some of them have the second one with a q value for fallback, other has a third one with a q value for additional fallback. For user who set Gujirati as their primary language and Hindi as their secondary language in the System settings, Safari will even send out a fouth one with a q value for the third fallback ("gu-IN,gu;q=0.9,hi-IN;q=0.8,hi;q=0.7").

Below are result from Safari "Version 17.4.1 (19618.1.15.11.14)" on my Mac Air MacOs 14.4.1 (23E224)

Here is what I see in the HTTP header if you change the UI to Hindi, English, and Traditional Chinese:

Accept-Language: hi-IN, hi;q=0.9
Accept-Language: en-US, en;q=0.9
Accept-Language: zh-TW, zh-Hant;q=0.9

As you can see from the above header, all of above Accept-Language sent out by Safari version 17.4.1 include two (not one) languages.

The first line include hi-IN as the first language (Hindi used in India), with a second language hi (Hindi as not specific to any region in the world) as fallback with weight The second line include en-US as the first language (American English), with a second language en (English as not specific to any region in the world) as fallback with weight The third line include zh-TW as the first language (Chinese uesed in Taiwan), with a second language zh-Hant (Chinese written in Traditional Chinese characters) as fallback with weight

Per definintion of HTTP 1.1 these lines all have two languages, not one! According to https://datatracker.ietf.org/doc/html/rfc2068 or https://datatracker.ietf.org/doc/html/rfc7231

Accept-Language = 1#( language-range [ weight ] ) language-range = <language-range, see [RFC4647], Section 2.1>

Tanych commented 3 months ago

@sffc

First, web sites definitely do use Accept-Language information, including fallback languages

We don't mean web sites are not using Accept-Language, what we saw is the number of sites using Accept-Language is low comparing overall sites in the web.

Second, fallback language can be useful for inferring other locale extensions

our proposal took consideration for fallback language, for example if the primary language is en-US, browser side language negotiation will consider to match en-US, en, and find the best match language. For more examples, see the implementation docs.

@FrankYFTang

"safari since it only contains one language"

I would like to make it clear for safari case, one language means only one of user's preferred languages takes effect, For example, If user set two or more preferred language as en-US, zh-CN, zh-TW. The Accept-Language header only contains the first user's language, it can potentially expand to two if the Accept-Language includes a region code, like en-US, the Accept-Language header will be extended to like en-US,en;q=0.9. However, if we use JS getter navigator.languages we can only get one language no any fallback language, in this case, it returns en-US. Why i say the Accept-Language HTTP header is potentially expand to two languages, e.g. If user's primary language without any region code, the Accept-Language won't have any fallback language, for example, user prefer language list as ja, en-US, zh-CN,ZH-TW, the Accept-Language header will only contains ja without any fallback language, JS getter also returns only one language, in this case is also ja.

FrankYFTang commented 3 months ago

We don't mean web sites are not using Accept-Language, what we saw is the number of sites using Accept-Language is low comparing overall sites in the web.

Could you published how many and which sites have you collected data in your research and how you "saw" that? What experimental method did you use to conclude such finding? If you collect data from the top 10,000 sites in the web, how many of them will not respect the fallback in the Accept-Language header? For example, how many of these sites will 1) return French content if the Accept-Language is set to

Accept-Language: fr

AND 2) not return Zhuang content if

Accept-Language: za

AND 3) will NOT return French content if the Accept-Language: is set to

Accept-Language: za,fr;q=0.9

If all three of the condition above is true, that mean that site does not listen to the fallback, right? Otherwise, the site MAY listen to the fallback.

(in that case, the website support French, but not Zhuang, but will not return French if it is French is only in the fallback list when Zhuang is not supported)

Since you already "saw is the number of sites using Accept-Language is low" , could you tell us what is the percentage? how low? 50% 45% or 40%?

our proposal took consideration for fallback language, for example if the primary language is en-US, browser side language negotiation will consider to match en-US, en, and find the best match language.

If the user has

Accept-Language: fr-DZ;q=0.9, fr-CA;q=0.8, fr-FR;q=0.7

and the site has content of Canadian French (fr-CA), but not Algeria French (fr-DZ), how would your proposal make it it return Canadian French instead of Frace French (fr-FR) ?

I would like to make it clear for safari case, one language means only one of user's preferred languages takes effect,

First, If that is truely what you mean, then since in your proposal you also write "That means we only send only one language in the Accept-Language request header." and later

Get / HTTP/1.1
Host: example.com
Accept-Language: en

Could you please change that example to show

Get / HTTP/1.1
Host: example.com
Accept-Language: en-US,en;q=0.9

to make sure people understand in your proposal the Accept-Language: will still output 2 language-range (since that also totally fit your interpretation of "contain only one" , right?) as long as they came from one item. or even better, if you can change it to

Get / HTTP/1.1
Host: example.com
Accept-Language: zh-TW,zh-Hant;q=0.9

as a better example since then people won't misunderstand the second one could only be a substring of the first one plus ";q=0.9"

Second, how did you conclude that? Do you have access to Safari source code to verify that? What is the logic of outputting that Accept-Language headers in Safari now. Could you show me the algorithm how it output

Accept-Language: zh-TW, zh-Hant;q=0.9
Accept-Language: zh-HK,zh-Hant;q=0.9

when I select Chinese (Traditional) and the second case Cantonese (Traditional)

I would like to make it clear for safari case, one language means only one of user's preferred languages takes effect,

Also, when a Gujarati user set their System settings to use Gujarati as the first language and Hindi as the second language, Safari will output Accept-Language header as

Accept-Language: gu-IN,gu;q=0.9,hi-IN;q=0.8,hi;q=0.7

and this is a clear counterexamput oppose to the statement "only one of user's preferred languages takes effect" since both Gujirati and Hindi are output into the Accept-Languge header. They are listed as two different languages in the "eight schedule" of The Constitution of India (language 5 and language 6 in p.325 of The Constitution of India )

miketaylr commented 3 months ago

@FrankYFTang for Accept-Language: zh-HK,zh-Hant;q=0.9, can you share the output of navigator.language? Thanks!

FrankYFTang commented 3 months ago

@FrankYFTang for Accept-Language: zh-HK,zh-Hant;q=0.9, can you share the output of navigator.language? Thanks!

"zh-HK" in 17.4.1(19618.1.15.11.14)on my MacOS 14.4.1(23E224)

sffc commented 3 months ago

We don't mean web sites are not using Accept-Language, what we saw is the number of sites using Accept-Language is low comparing overall sites in the web.

I gave several examples of high traffic web sites doing the right thing with Accept-Language.

For the ones using a different inference mechanism such as GeoIP, there might be specific business needs like commerce, or they might just have not invested fully in proper internationalization, which is a problem I see too often.

our proposal took consideration for fallback language, for example if the primary language is en-US, browser side language negotiation will consider to match en-US, en, and find the best match language.

You may be misunderstanding my second point. Note that I am using eu (Basque, less common) falling back to English.


Also, your proposed Available-Languages response header doesn't seem practical to me. The best web sites are translated into at least 70 languages, sometimes more than 150. That's a lot of data to include. Accept-Language is comparatively small!

FrankYFTang commented 3 months ago

Here is a study about statstics "There are approximately 3.3 billion bilingual people worldwide, accounting for 43% of the population" (Gration).

Works Cited

Gration, Elizabeth. “Bilingualism Statistics in 2024: US, UK & Global.” Language Learning with Preply Blog, 17 Apr. 2024, preply.com/en/blog/bilingualism-statistics.

FrankYFTang commented 3 months ago

Also, certain countries in the world require multilingual support since people in that countries usually use more than one language in their daily life:

Tier 1: High Importance

Tier 2: Moderate Importance

Tier 3: Lower Importance

FrankYFTang commented 3 months ago

Here are some technology / framework which support Accept-Language fallback that many website build on top of, according to my new BFF Gemini

Many popular web frameworks provide built-in support or mechanisms for handling the HTTP Accept-Language header and implementing fallback behavior. Here are a few examples:

  1. Spring Framework (Java):
  1. Django (Python):
  1. Ruby on Rails:
  1. Express.js (Node.js):
  1. ASP.NET Core:
FrankYFTang commented 3 months ago

what we saw is the number of sites using Accept-Language is low comparing overall sites in the web.

Is that a true statement? I really doubt your claim. Please list 10 sites that you think which is the case. I have hard time to find any site on the web which does not use Accept-Language or Accept-Language fallback for language content negotation.

sffc commented 3 months ago

Just to add to the ecosystem support research:

Node.js: https://www.npmjs.com/package/i18next-http-middleware is probably the most popular Node.js i18n plugin. It considers Accept-Language for language detection, with fallbacks to other language detection modes. I looked at the code and verified that it handles the fallback list, including q values.

Python Django: https://docs.djangoproject.com/en/5.0/topics/i18n/ explains how it uses Accept-Language. The code appears to correctly handle fallback values.

WordPress: https://translatepress.com/docs/addons/automatic-user-language-detection/ is a plugin to use either Accept-Language or GeoIP for language detection. I could not find the source code so wasn't able to verify if it handles fallback.

FrankYFTang commented 3 months ago

Give you a real life example of how multilingual website utilize Accept-Language in the bay area

Many Malaysia born Chinese know both Malay and Chinese, if you configure their Chrome language setting to Malay, Traditional Chiense, English today, their chrome will send out the following Accept-Language today

Accept-Language: ms,zh-TW;q=0.9,zh;q=0.8,en-US;q=0.7,en;q=0.6

Fremont city goverment website support English, Traditional Chinese, Simplified Chinese, Korean, Vietnamese, Spanish, Hindi, and Panjabi but not Malay

If their website does not support Accept-Language fallback, the user will get English as default

If Chrome implement your proposal it will send out only

Accept-Language: ms 

and get English as well But now they return back Traditional Chinese because the user set Traditional Chinese as their fallback after Malay

Sites you can try

Fremont city goverment website https://www.fremont.gov/ Mountiain View City Goverment https://www.mountainview.gov/ Mountain View Library https://library.mountainview.gov/

FrankYFTang commented 3 months ago

I really have a hard time to understand how can the goal of this proposal to be achieved

Problem: ... As part of the Chrome team’s anti-covert tracking efforts, we would like to improve privacy protections by minimizing passive fingerprinting surfaces. ... Proposal: ... We propose that, by default, the browser should only send the user's most preferred language in the Accept-Language header instead of sending all languages. That means we only send only one language in the Accept-Language request header.

For any users who do not touch their language preference, your proposal will not reduce any entropy for them Without your proposal, Safari send out two items in Accept-Language as I menteiond, for example "Accept-Language: zh-HK,zh-Hant;q=0.9" and with your proposal, Safari send out one item as "Accept-Language: zh-HK" for the same group of users, all the same, just different text.

there will be no impact to their privacy by your proposal, right?

Now, there are X% of users bother to add additional items to their language preference. Could you share with us what is X based on your study? If X is very small, then your proposal will have a very tiny impact, right? If X is big then your proposal will have a big impact and that impact could be either positive or negative. But since you propose this change, it should be reasonable for me to ask you to share with us what is your estimation of X and your research method to conclude that, right?

sffc commented 3 months ago

From Frank's comment above, we know that 43% of Web users are multilingual, and of those, some percentage will use a multilingual fallback list that deviates from what we could assume is the most likely fallback list for a particular language. This proposal most directly impacts the Web Platform experience for those users.

sffc commented 3 months ago

One other comment about the proposal overall is that it moves Passive Fingerprinting to Active Fingerprinting, but it still allows fingerprinting; the app just needs an extra request to get the additional info. With my work on the Locale Extensions proposal, what I heard from other browser vendors is that they don't consider the shift to Active Fingerprinting to be a significant win for Web Platform privacy.

miketaylr commented 3 months ago

Honestly, it's a bit hard to follow and/or respond to this wall of comments (just from today...) - it's coming off as very passionate and a little unfocused. I will assume the passion is coming from a good place, but I would appreciate if we can keep the feedback constructive. Thanks.

hanguokai commented 3 months ago

My overall view on this proposal

Users care about their privacy, but they also care about the web browsing experience, so privacy and convenience need to be balanced. HTTP-Header and JS API are the cornerstones of Web i18n. The current problem is that the new proposal subverts it, and the cost of this change is so high (for users and developers) that it is doubtful whether it is really effective and worth pursuing. And the current i18n technology is already relatively complicated in practice.

Some details

  1. "Accept Language Cache" in this proposal is cached the site (origin) 's supported languages, not a page's supported languages? For example, "https://example.com/page-A" or the home page support 5 languages, but "https://example.com/page-B" only supports 2 languages. This situation requires renegotiation.

  2. Some websites redirect to different url for different language, other websites rewrite different languages content in the same url. Can this proposal fit both situations?

  3. When users select a language from website's language-select menu (e.g. saved in cookie), the server needs to check this clue first. In this case, the server doesn't need to return variants header, right?

I can't imagine all the server-side processing logic yet for this proposal. In practice I think it's going to be complicated to support this language negotiation process.

miketaylr commented 3 months ago

Thanks @hanguokai - I appreciate your feedback, and it's noted.