Expand internationalisation logic to add pluralisation

What

Expand our internationalisation logic to handle pluralisation. This will allow users to be able to translate hardcoded strings in the character count component.

For example: “You have 1 character remaining” vs “You have 20 characters remaining”

Needs to handle languages other than English, that may have different plural forms, see https://cldr.unicode.org/index/cldr-spec/plural-rules

Why

Some components will need to be able to switch between different strings based on the value of something, e.g: the character count component will need to switch the count message based on the count (see example above)

Who needs to work on this

Developers

Who needs to review this

Developers

Done when

[x] Internationalisation logic expanded to introduce support for pluralisation + unit tests
[x] As noted in feedback, ensure there is only one key for 'zero’ translations
[x] Internationalisation logic expanded to introduce support for number formatting, where relevant + tests
[x] Nicely handles when a required plural form for a language is not provided, e.g: a clear error message (see related comment from pre-release testing)

As noted in feedback, ensure there is only one key for 'zero’ translations

A quick thought as I read the feedback: isn't this point more linked to the character count component itself (where the meaning of under_limit_zero and over_limit_zero are the same) that the pluralisation of translations itself? It feels more up to the component to check "oh, I've got zero, so I use that special key". If that's the case, it's maybe more part of #2805/#2701.

So, I tried to wrap my head around what would need to happen based on the spike and it seems like it's a two part update:

A. I18n constructor function to receive both locale and translations, so it can decide how to pick pluralised form:

The shape of the options passed to the constructor in the spike allowed for both to be passed, but no longer in the current implementation. Was that to provide the minimal API that'd make the tests pass or because a different direction is taken? (the locale option does not appear in the final comment for the Javascript and Nunjucks API
The default locale was pulled from the document's lang in the spike. Should that be kept? This may cause confusion if the locale in the doc doesn't match the language of the translation strings and it feels safer to be explicit by passing a locale alongside the strings. I may have missed some comments with a decision on this, though
Do we need other options? I was thinking specifically about something to let people set the function for picking plural rules, in case we missed a locale. Again, this may be something that was already discussed and I missed

Feels like this can be implemented on its own ahead ot the changes to the t method (see below).

B. Then there's the meaty bit of updating I18n.t(key, {count: 123, otherPart:...}) and adding relevant tests:

Compute the suffix to append to the requested key for lookin up the translated string. That suffix depends on count and the locale set in the I18n object. In the spike, it uses Intl.PluralRules and falls back to two maps for mapping:
- BCP 47/RFC 5646 locale codes to groups of shared ways to handle plural rules,
- then said groups to functions converting the count to a suffix that'll be appended to the key. Two questions sprung to mind:
  a. What's the reason for rolling our own, vs. polyfilling? Was it to avoid the weight of another polyfill (15+KB for this one, with a few dependencies for it as well, judging by polyfill-library, not sure if included or not)? b. Kind of tied to A.3. : The maps load all rules for all languages. Were there discussions for loading only what's needed that I may have missed?
Lookup the corresponding translation based on that key, showing an explicit error if the key is missing
Finally, interpolate the translation with whatever data is provided (count included). The issue mentions " Internationalisation logic expanded to introduce support for number formatting, where relevant + tests", but: a. I couldn't find any discussion on that topic specifically, nor example in the spike (is it a different spike maybe?) b. This feels orthogonal to pluralising so maybe worth its own issue: people may want to format numbers in regular translations (eg. to show a price with 2 digit decimals).

So, I tried to wrap my head around what would need to happen based on the spike and it seems like it's a two part update:

A. I18n constructor function to receive both locale and translations, so it can decide how to pick pluralised form:

The shape of the options passed to the constructor in the spike allowed for both to be passed, but no longer in the current implementation. Was that to provide the minimal API that'd make the tests pass or because a different direction is taken? (the locale option does not appear in the final comment for the Javascript and Nunjucks API

The latter, I believe – there's no need for a locale option for the functionality implemented so far, so we haven't added it.

I think the intent was to include locale in the options object, rather than add a new argument?

The default locale was pulled from the document's lang in the spike. Should that be kept? This may cause confusion if the locale in the doc doesn't match the language of the translation strings and it feels safer to be explicit by passing a locale alongside the strings. I may have missed some comments with a decision on this, though

I think there are a few routes we could go down:

calculate the lang for the $module by recursively looking for a lang attribute, starting with the $module itself and up to and including the root element
calculate the lang for the $module by recursively looking for a lang attribute, starting with the $module itself and up to and including the root element, but allow this to be overridden by config
explicitly require it to be set in config
explicitly require it to be set in config (but we have to 'fall back' to something, either the document or just hardcoded 'en'?)

With the first two options I think this might mean the component class working out the locale and passing it to the i18n object? Unless we pass the $module itself so that the i18n class can recurse through the parent elements?

I'm not 100% sure about the second option as it'd mean allowing for both a lang and a data-locale attribute on the $module? 🤔

Do we need other options? I was thinking specifically about something to let people set the function for picking plural rules, in case we missed a locale. Again, this may be something that was already discussed and I missed

I'd suggest we can add this in the future if we find there's a need for it.

Feels like this can be implemented on its own ahead ot the changes to the t method (see below).

B. Then there's the meaty bit of updating I18n.t(key, {count: 123, otherPart:...}) and adding relevant tests:

Compute the suffix to append to the requested key for lookin up the translated string. That suffix depends on count and the locale set in the I18n object. In the spike, it uses Intl.PluralRules and falls back to two maps for mapping:

BCP 47/RFC 5646 locale codes to groups of shared ways to handle plural rules,

then said groups to functions converting the count to a suffix that'll be appended to the key. Two questions sprung to mind: a. What's the reason for rolling our own, vs. polyfilling? Was it to avoid the weight of another polyfill (15+KB for this one, with a few dependencies for it as well, judging by polyfill-library, not sure if included or not)? b. Kind of tied to A.3. : The maps load all rules for all languages. Were there discussions for loading only what's needed that I may have missed?

Suspect @querkmachine is the best person to answer these ones!

Lookup the corresponding translation based on that key, showing an explicit error if the key is missing

Finally, interpolate the translation with whatever data is provided (count included). The issue mentions " Internationalisation logic expanded to introduce support for number formatting, where relevant + tests", but: a. I couldn't find any discussion on that topic specifically, nor example in the spike (is it a different spike maybe?) b. This feels orthogonal to pluralising so maybe worth its own issue: people may want to format numbers in regular translations (eg. to show a price with 2 digit decimals).

I think number formatting should be tackled separately in https://github.com/alphagov/govuk-frontend/issues/2568

I think the intent was to include locale in the options object, rather than add a new argument?

I should have exemplified there for clarity, sorry. The spike was indeed receiving options as two keys in an object:

{
  locale: "en",
  translations: {
    key1: 'Translated string'
  }
}

In its current state, the constructor expects the translations straight away:

{
  key1: 'Translated string'
}

I was wondering if this was because of putting in just the code to make tests pass or another decision (as having locale + translations next to each other in an object looks a pretty good option)

I think there are a few routes we could go down:

calculate the lang for the $module by recursively looking for a lang attribute, starting with the $module itself and up to and including the root element

calculate the lang for the $module by recursively looking for a lang attribute, starting with the $module itself and up to and including the root element, but allow this to be overridden by config

explicitly require it to be set in config

explicitly require it to be set in config (but we have to 'fall back' to something, either the document or just hardcoded 'en'?)

With the first two options I think this might mean the component class working out the locale and passing it to the i18n object? Unless we pass the $module itself so that the i18n class can recurse through the parent elements?

I'm not 100% sure about the second option as it'd mean allowing for both a lang and a data-locale attribute on the $module? 🤔

At the I18n class level, it feels a bit weird to have to wrangle with the concept of a $module that's in the DOM and that you can use to look further up for a lang information. It feels more like the responsibility of the components themselves to gather the right arguments to pass to their I18n instance.

I like the idea of looking up the DOM for the closest [lang] and picking that as a fallback. It's a double edged sword, though, as it'd allow to set the value for multiple components at once, but at the same time, risk having components with translation strings in one language and their pluralisation rules in different locale by mistake. Is it complexity we need from the start though, or something that may be easier provided through a base class in v5.0?

Unrelated: All that talk about locale made me think we should likely consider it for the data attributes expected by the character count (added a comment to that effect on that issue).

I'd suggest we can add this in the future if we find there's a need for it.

Suspect @querkmachine is the best person to answer these ones!

I think number formatting should be tackled separately in https://github.com/alphagov/govuk-frontend/issues/2568

👍🏻 x3 😄

I was wondering if this was because of putting in just the code to make tests pass or another decision (as having locale + translations next to each other in an object looks a pretty good option)

Unsure! Possibly another one for @querkmachine when it's back tomorrow…

I like the idea of looking up the DOM for the closest [lang] and picking that as a fallback. It's a double edged sword, though, as it'd allow to set the value for multiple components at once, but at the same time, risk having components with translation strings in one language and their pluralisation rules in different locale by mistake. Is it complexity we need from the start though, or something that may be easier provided through a base class in v5.0?

All good points. As you suggest, maybe we should keep it simple and expect an explicit data-i18n.locale for now, and we can look at doing something 'cleverer' in 5.0?

If we were going to go down that route I think I'd prefer to hardcode 'en' as the fallback rather than looking at the document lang, purely to avoid encouraging users adding a lang attribute on the document if the whole document isn't translated (e.g. the header and footer are in English, but the contents of the main are translated). I might be overthinking it though…

For some background context: Part of my reasoning for including locale as an explicit option in the initial spike was because of the need to hardcode pluralisation rules for legacy IE.

I was envisaging a situation where a user might need to translate a component into a language which the hardcoded rules don't support; for example, Mongolian. However, Mongolian has the same plural forms as English, so a user could choose to use Mongolian translation strings but set the locale option to 'en' to use English language pluralisation logic.

Since then I've come to think that this functionality is probably extraneous and won't be needed by 98+% of users. It will also not be necessary when we remove JS support from IE, as the appropriate plural form information will already be available in evergreen browsers as Intl.PluralRules.

If we opt to keep it as a configuration option, I worry that it's more likely to cause confusion unless we make the intended use more explicit: such as splitting the option in two and naming them something like overridePluralRulesLocale and overrideNumberFormatLocale.

So, my thinking here was that, rather than allowing the user to provide a locale property as we did in the spikes, we would instead find and use the lang attribute of a parent element or from the document level.

My reasoning for this is that the lang attribute needs to be set anyway for accessibility compliance, both on the page-level and for any sections of the page that are in different languages to the rest of the page (WCAG criterion 3.1.1 (Level A) and 3.1.2 (Level AA)).

By making lang be the only way of doing this, we would be enforcing (or, at least, very heavily encouraging) users to comply with accessibility guidelines, and prevent issues with one locale being given in data attributes/JS configuration that isn't aligned with the actual language in use. Keeping it simple now, and only adding that complexity if users feel they need it, seems the better route to take.

My hypothesis is that doing it this way shouldn't add unnecessary code weight as we already include an Element.prototype.closest polyfill in Frontend, though that could always be wrong!

For the other questions I just noticed (cc: @romaricpascal)

a. What's the reason for rolling our own, vs. polyfilling? Was it to avoid the weight of another polyfill (15+KB for this one, with a few dependencies for it as well, judging by polyfill-library, not sure if included or not)?

Avoiding polyfills and their dependencies and simplicity of implementation. We don't actually need a lot of the PluralRules functionality at the moment, we only need to support positive integers (not negatives, decimals, and ranges as the spec supports).

b. Kind of tied to A.3. : The maps load all rules for all languages. Were there discussions for loading only what's needed that I may have missed?

I included a whole bunch of languages in the spike mainly for completion's sake (i.e. the work is already done and there if we need it). In practice, services are only provided in a few different languages—English, Welsh, and very occassionally Arabic—although GOV.UK content pages use many more. I imagined we would be able to cut down on the size of these if and when we had a better idea of what languages to prioritise.

We don't have any existing architecture for Frontend to conditionally load certain files, or having users load specific configuration files, which is why they're all bundled together for now. We've discussed using dynamic imports in the past, but this would be something we would introduce post-IE support, at which point we'd be removing the hardcoded plural rule maps anyway.

alphagov / govuk-frontend