Prevent XSS from translation strings

A bit of context

Components supporting internationalisation currently inject translated content using innerHTML. This is necessary to allow visually hidden text to be injected, like in the accordion "Show/Hide this section" messages. This content can reach them two ways, through:

their JavaScript configuration
data-attributes on the element The later also acts as a bridge between options passed to the Nunjucks macro and the JavaScript to make it easy to pass the translations.

While innerHTML won't execute <script> tags, there's a wide range of malicious string that could be used for Cross Site scripting should they end up as is in the HTML. With our current components, translations strings should not be user-generated, which limits the risks. However, this is not to say they'll always be, and that there won't be someone using our components with user-generated content for a translation string at some point in the future.

This calls for some thoughts and there are a few routes we could take (pooled from the Dev catch-up)

1. Leave it to the services

Do nothing and trust the services to pass sanitised strings for us to inject. Maybe at some point in the future, introduce some sanitisation to limit HTML that can be injected

Advantages

No effort on our part

Impact

None

Concerns

There's no way to guarantee that we'll receive sanitised HTML in all scenarios, and ultimately, we're the ones injecting it into the DOM
Sanitising later means introducing breaking changes for people that were happily injecting HTML to solve their issues

2. Sanitise HTML before injection

Before injecting the content on the page, sanitise the strings to only allow a specific set of elements and attributes.

Advantages

Extra safety as we only allow what we consider safe
Sets the path for future attribute reads to be sanitised

Impact

Extra work to either develop it or pick a library (npm has plenty, of various sizes). May be done in a central way as part of the string normalisation

Cons

We'll need to allow the right tags/attributes to be safe, but not restrict too much that we don't meet users needs.
Some more JS shipped to users (this can range from pretty small if we end up with a solution like Bootstrap's to much larger with more full-fledged libraries)

3. Have a separate API for text injection and HTML injection

Similarly to the Nunjucks templates, offer two paths for getting content in the DOM:

a text option, that injects the content safely, as text (using textContent, for ex)
an html option, that keeps injecting it through innerHTML, unsanitised

We'd encourage the use of the first one, potentially flagging the second one as "dangerous". We'd also explain to users that it's down to them to ensure that only sanitised content ends up in the HTML part.

Advantages

Safe default route
Clear opt-in from users that need HTML injected

Impacts

Components would need updating to look up the presence of either the HTML and non-HTML version of the translation and inject accordingly
Nunjucks API would need updating to pass both versions of the translation

Concerns

Complexity of changes to make, especially when needing to override default (we may want HTML to take precedence, but the accordion will likely provide it as default translation, so we'll need to ignore it if someone wants to set a text translation, for ex.)
Does it solve the issue, as ultimately, we'd still vulnerable to XSS when using the "dangerous" route?

4. Only allow translations to be passed through JavaScript

Restrict the passing of translation strings via JavaScript, to prevent them from appearing on element, possibly making them harder to modify?

Advantages

With strings coming from the JavaScript, we're more confident that they come directly from the service and were not user input

Impacts

Removing i18n attributes when merging configuration
Finding a different (but safe) way to pass translations from Nunjucks to JavaScript

Concerns

Pretty big rollback on the features
Complexity of finding a way to pass attributes
Is there still a risk that strings inside the JavaScript could be tempered with?

5. Add a separate interpolation for visually hidden text

Ultimately, the reason for using innerHTML is to inject <span class="govuk-visually-hidden">...</span>s to help accessibility, so we may want to focus on doing just that rather than trying to allow any HTML.

This could take the form of a special marker in translation strings to wrap content to be visually hidden (as it won't necessarily only be appended).

Advantages

Restricts the possibility of abuse: we inject text
Focused on solving only the issue we have

Impacts

Devise a marker to wrap content that needs to be visually hidden
Implement the replacement of the marker with the <span>s. Note that this would need to be after the text is safely injected on the page with textContent. If we were to interpolate first, then we'd have to use innerHTML to inject the interpolated string, leaving us vulnerable if someone put some malicious code alongside the visually hidden content.

Concerns

Lack of flexibility, it only allows injecting visually hidden text
Complexity of injecting the content to ensure we don't remain vulnerable

6. Add a separate translation string for accessible name

We're using visually hidden content to provide a different accessible name. Another route for it could be to allow a translation to be passed for the aria-label attribute and restrict the visible label to be injected only as text. `` Advantages

Restricts the possibility of abuse: we inject text all the time
Focused on solving only the issue we have

Impacts

Allow the accordion to receive translation strings for the aria-label attributes of the two states (show/hide) of its two buttons (all, per section) and use them accordingly

Concerns

Lack of flexibility, only allows to update the accessible name
We ship default translations with our components, meaning we'll need extra logic to override those. For ex. if a new visible label is provided, is it because users wanted to keep the default accessible name, or forgot to update it?

Who needs to work on this

Developers

Who needs to review this

Developers

Done when

[x] A strategy for mitigating XSS has been chosen...
[x] ... and implemented (if necessary)

And an extra option to maybe add in the mix, which I didn't think about this morning:

7. Allow services to provide their own sanitisation function

Before injecting strings in the markup, give the chance to the services to sanitise the HTML that was interpolated by I18n.

Advantages

Enable sanitisation without forcing a way to sanitise

Impacts

Figure a way to globally configure a sanitiser to clean up strings interpolated by I18n.

Concerns

Burden is on the services to provide their sanitisation function, which means it may be forgotten (this could be combined with a lightweight implementation of 2 as a default though).

My quick takes:

In general, XSS potential is entirely dependent on whether a service is injecting user-provided strings directly into HTML without first going through a server-side process.

This isn't an expected use case for any of our own components, nor do I imagine it being a common use case for most others that might use the I18n function, so my overwhelming view is to leave it to the service teams who actually need it to implement, and for us to not burden Frontend with something very few teams will actually need.

I believe there's a separate discussion happening about whether we should be putting visually-hidden text in controls at all (e.g. #2117) due to it preventing voice control users from using them effectively.

1. Leave it to the services Many server-side web languages come with HTML sanitisation functions out-of-the-box, and these would probably do a better job security-wise than anything we can implement in Frontend. Only potential issue I can imagine is HTML being encoded in a way that our JS can't convert back into useable HTML, but that still feels like an issue for the service team to fix rather than us.

2. Sanitise HTML before injection If we do anything, this feels like the 'best fit' approach to me, doing just enough without being overly restrictive. Leaving it to an npm package to handle is probably a good idea given we don't have any particular security expertise on the team, to my knowledge.

3. Have a separate API for text injection and HTML injection Seems like it's probably overcomplicating from both a programming and a documentation perspective (especially character count, which would have 20+ i18n parameters in this situation!)

4. Only allow translations to be passed through JavaScript Probably a non-starter based on our current trajectory and feedback from our proposal/pre-release.

5. Add a separate interpolation for visually hidden text @36degrees proposed doing this previously. I'm not extremely opposed to doing this, but it's a single solution that solves a single problem. It requires some design/documentation around custom syntax rules used in our i18n JS that isn't necessarily consistent elsewhere (i.e. Nunjucks), and I don't personally think it scales well if it turns out service developers need other, similar functionality (bold text, links, etc.)

6. Add a separate translation string for accessible name I think this could be viable but may require some a11y testing to validate first, especially on non-interactive and non-focusable elements.

7. Allow services to provide their own sanitisation function Same as option 1.

Another issue with 2 is that HTML that's output from in JavaScript will be sanitised, but HTML output from Nunjucks will not. So we'd be introducing an inconsistency that users might not appreciate, and could potentially cause issues – it feels slightly dangerous to sometimes but not always sanitise HTML.

My initial post was just a collection of the options we thought about in the dev catch-up on Monday. On my side:

1. Leave it to the services

Not opposed to it given the scope of our library and the strings being translated which limit the risks of user facing content coming in, but see 7.

2. Sanitise HTML before injection

Feels like the "ideal thing to do" and I'd be rooting for it. However, as @36degrees points out, we wouldn't treat all strings the same between Nunjucks and client-side, but provide the same ...Html options for setting text in components which will be confusing. I believe given we want our templates to just be imported without having people to set filters in their Nunjucks environment, we can't really get Nunjucks to sanitise. So sounds like this option will have to be left out (for now?)

3. Have a separate API for text injection and HTML injection and 4. Only allow translations to be passed through JavaScript

Agreed with @querkmachine for both of them, doesn't sound like a workable way to go.

5. Add a separate interpolation for visually hidden text

Not a fan of this one for two reasons: it's potential for conflict existing formatting syntax and it's narrow scope that won't scale should we have needs for other tags.

6. Add a separate translation string for accessible name

My understanding was that this would only apply to the buttons and not non-interactive elements like the characted count, as they're the ones we need to pass HTML to at the moment. But that would make this a very ad-hoc fix and not very future proof should another component require some hidden text outside of an interactive element, so that may need a good deal of testing to make sure we're not breaking anyone's experience. On top of it, the naming conventions in our API would make it a bit weird as we'd have an <translationKeyAccessibleName>Html option (though we could also use AccessibleName as a marker for passing translation keys.

7. Allow services to provide their own sanitisation function

Technically they can already do this by overriding I18n.prototype.t to post-process the resulting string. I felt that providing a clear API would cover all grounds (like I18n.setSanitiser((html) => {...}) or GovUK.setSanitiser((html) -> {...} depending on where we think the responsibility falls) :

using only dev provided strings -> don't worry about sanitising
using user provided strings and your backend sanitises -> sanitise in the backend
using user provided strings and can't sanitise in backend -> set your custom sanitiser

That's something we could document once in the config & i18n API and be done with, regardless of if new components start using the i18n API in places that may

This would also not bind us to a certain way of sanitising HTML (which we could always introduce in a future breaking change).

Another issue with 2 is that HTML that's output from in JavaScript will be sanitised, but HTML output from Nunjucks will not. So we'd be introducing an inconsistency that users might not appreciate, and could potentially cause issues – it feels slightly dangerous to sometimes but not always sanitise HTML.

For safety reasons, I'd also feel more comfortable with an "untrusted by default" approach

Following Nunjucks, where you can opt-in to render (safe) content with escaping turned off:

{{ params.html | safe if params.html else params.text }}

Admittedly it's difficult to exploit, but we currently allow components to render, escaping turned on, then still execute untrusted JavaScript. Here's one for Accordion "Show all" button .innerHTML demonstrated:

Executable `<img>` error handler

Example staff-facing "case management" service

1. Accordion renders example `${personName}` placeholder

This is rendered using params.hideAllSectionsHtml in {{ govukAccordion(params) }}

2. Exploit is carefully crafted

Whilst .innerHTML won't execute code, we can inject other HTML that will

<img src="dodgy" onerror="alert('Exploit')">

3. Person changes their name to include the exploit

personName: `Colin Rotherham <img src="dodgy" onerror="alert('Exploit')">`

4. Page is loaded, rendering untrusted content

Notice how escaping is turned on (best practice) but the exploit is still included

<div
  class="govuk-accordion"
  data-module="govuk-accordion"
  data-i18n.hide-all-sections="Colin Rotherham &lt;img src=&quot;dodgy&quot; onerror=&quot;alert(&apos;Exploit&apos;)&quot;&gt;"
>

4. Exploit runs

The <img> src is invalid which runs the onerror handler

Accordion exploit example showing alert dialog

Did a bit of digging, as it occured to me while commenting on Slack that our Nunjucks template may also lead to HTML landing on the page. We got a good few ...Html or <something.html> options allowing users to pass custom HTML that our components output as is on the page.

We leave it to the services to sanitise whatever they pass:

You must sanitise any HTML you pass in to Nunjucks macros you’re using in your live application to protect your website against cross-site scripting (XSS) attacks. You can read more about XSS on the MDN website.

So rather than a heavy technical solution, which our options and discussions have been leaning towards so far, it may be a matter of documentation. Escaping was enough when the HTML passed to our component was rendered server-side (though kind-of breaks the point of passing HTML as all tags will be escaped). For the I18n messages, we'll need to emphasise on actual sanitisation (ie. tidy up of tags and attributes to only allow what's necessary). This could be by pointing to the OWASP documentation about it and mention that languages have libraries to take care of that complexity server side.

Besides that documentation, we may want to do a minor API update and pick a different suffix for the options related to I18n (Translation, or Message maybe). Reason for it would be to set these options apart from an option like titleHtml in the Panel and help people understand where they need strong sanitisation vs. simply escaping (though if they only escaped the content as a form of sanitisiation options, like titleText would do just the same).

@36degrees @colinrotherham We mentionned the aria-label route at then end of the milestone review, and this would be the steps to make it happen I believe:

rename translation keys from ...Html to ...Text as they'll now be injected by text, both on the Accordion (4 keys) and CharacterCount (6 keys)
replace injection of content in the CharacterCount to use textContent (2 places, visible and visually hidden announcement)
replace injection of content in the Accordion to use textContent (2 places, setExpanded and updateShowAllButton
add 2 new show/hideSectionAriaLabelText (or just ...AriaLabel) options to the macro and in the default JavaScript configuration and update setExpanded and updateShowAllButton to use them for labelling the buttons
check that the accessible name gets conveyed properly to assistive tech now we've changed it to use aria-label (feels like a safeish change

4 is a bit more complex than at first glance. The button's full accessible name is not just "Show/Hide + visually hidden text", but constructed of:

a. the heading itself b. a coma for pausing the screenreader c. "Show/Hide" + visually hidden text

That's where having visually hidden content made thing easy. Swap just the last part and we were good.

We wouldn't be able to just aria-label the <span> (no presence in the accessibility tree, I believe, especially as it sits inside a <button>), so we'll need to set it on the <button> and:

construct the aria-label matching the current structure
update how the button is constructed to no longer inject the hidden comma inside it and existing texts

That should help assessing the impact of going down that route.

It's worth bearing in mind that the issue of making visually hidden text localisable is not new – it's just the first time we've needed to combine it with text that's added or updated in the JS layer.

We already have a mix of different techniques across the codebase – we use aria-label in:

pagination (labels for each page link aria-label="{{ item.visuallyHiddenText | default("Page " + item.number) }}")
cookie banner (labels a region aria-label="{{ params.ariaLabel | default("Cookie banner") }})
header (labels a region <nav aria-label="{{ params.navigationLabel | default(menuButtonText) }}")

Meanwhile, we use a span with the govuk-visually-hidden class in:

summary list (after each 'action' link – there's an issue here because it assumes an English grammar so we might need to change how this is done – maybe we'll use aria-label!)
footer (hidden H2 above 'meta' links)
character count (screen reader count message)
error message ("Error:" prefix before message)
warning text (fallback text for '!' icon)

I think we have this mix because we're (mostly) doing what makes sense in the context of each component.

So I don't think there's any reason why if we chose to use aria-label here we'd be committing to using aria-label everywhere going forward.

This also isn't the first time we're allowing users to pass us HTML that we then output on the page, with the expectation that it's up to the service team to have made that HTML 'safe'. We already have guidance about this, although it could arguably be stronger.

I think the main differences here are that:

using HTML in unavoidable in the Accordion if you want to be consistent with the English translation and include a visually hidden part
relatedly, we usually provide a way to pass either Text or Html but in this case there is no option to pass a Text version, so users can't 'opt out' and avoid this being a problem
we are escaping the HTML using the Nunjucks escape filter, but it's being unescaped at the point it's read into the dataset in JS. I think this means that any user-generated content would need to be 'double escaped', and there might be some complexity relating to how Nunjucks escapes strings – I think it generally avoids double escaping unless you explicitly use forceescape?

Given what you describe around aria-label, it sounds like it's something we're already relying on in some components, so would be a viable route for us (provided we make the update cautiously to ensure the accessible name remain the same)

I agree it doesn't commit us to using aria-label everywhere, just solves our issue for the Accordion, the same way that it solved things for the other components that use it.

This also isn't the first time we're allowing users to pass us HTML that we then output on the page, with the expectation that it's up to the service team to have made that HTML 'safe'. We already have guidance about this, although it could arguably be stronger.

With the forceescape filter that you describe further down, I think that would put the texts/html injected by JavaScript in the same situation as those injected by Nunjucks on the server. If we're happy with that route for the Nunjucks injected options, and provided we flesh out our documentation a bit more (eg. present options for escaping, point to the OWASP page...), I'd be enclined to think it'd be a decent solution.

using HTML in unavoidable in the Accordion if you want to be consistent with the English translation and include a visually hidden part

I'm not sure I follow that statement as using aria-label would be a way out of that one

relatedly, we usually provide a way to pass either Text or Html but in this case there is no option to pass a Text version, so users can't 'opt out' and avoid this being a problem

This one would mean we have separate data attributes for the text and html version. This could be done by suffixing the keys, before adding the plural (eg. data-i18n.show-section.text, data-i18n.character-over-limit.html.one) and updating how the components render them (at that point, we'd likely want a I18n.prototype.setTranslatedText(element, key, options) to encapsulate that logic. That doubles the available options as well

we are escaping the HTML using the Nunjucks escape filter, but it's being unescaped at the point it's read into the dataset in JS. I think this means that any user-generated content would need to be 'double escaped', and there might be some complexity relating to how Nunjucks escapes strings – I think it generally avoids double escaping unless you explicitly use forceescape?

Do you mean that our templates would use forceescape and that we'd expect users to run escape (or properly sanitise their elements, with something like DOMPurify for ex) if they pass user-provided content to the helper?

How about:

Giving a go at forceescape to check that it would escape already escaped text
If unsatisfactory, look into using aria-label for the accordion.

The govukPluralisedI18nAttributes macro may be a neat place to quickly try the effect of forceescape in isolation 😊

That forceescape filter looks very promising for putting messages injected in JavaScript at the same level as those injected by Nunjucks templates.

There are 3 situations ahead of us:

the string is not escaped
the string was escaped with | escape
the string was escaped by some other means (eg. in the backend prior to getting in the view).

escape would only run the character escaping on 1 and 3. This is because under the hood, it first checks if the content to escape is already marked as a safe string.

Switching to forceescape would mean that whichever way they're escaped, the strings provided as attributes would end up escaped. This means that services that would want to provide user-submitted strings as one of their messages have a cheap way to escape by using the | escape filter. Using set, they could even construct strings where only part of the content will be escaped twice, allowing them to still inject hidden elements if necessary:

{%- set contentWithHiddenText %}
    {{ valueFromUser | e }}<span class="govuk-visually-hidden">{{ someSafeDevValue}}</span>
{% endset %}

forceescape also provides a bit of safety in case we receive a string that went throught | safe. With only escape if the string had any quote, it'd close the attribute and wreck the markup, while with forceescape that quote will get escaped and we'll get proper render. So, all in all, a move I think we should make.

However...

forceescape only helps people using the Nunjucks templates. If you only use our JavaScript on top of HTML generated some other way, the burden of ensuring proper escaping will be down to whatever generates the HTML.

alphagov / govuk-frontend