alphagov / govuk-frontend

GOV.UK Frontend contains the code you need to start building a user interface for government platforms and services.
https://frontend.design-system.service.gov.uk/
MIT License
1.18k stars 321 forks source link

Prevent XSS from translation strings #2892

Closed romaricpascal closed 1 year ago

romaricpascal commented 2 years ago

A bit of context

Components supporting internationalisation currently inject translated content using innerHTML. This is necessary to allow visually hidden text to be injected, like in the accordion "Show/Hide this section" messages. This content can reach them two ways, through:

While innerHTML won't execute <script> tags, there's a wide range of malicious string that could be used for Cross Site scripting should they end up as is in the HTML. With our current components, translations strings should not be user-generated, which limits the risks. However, this is not to say they'll always be, and that there won't be someone using our components with user-generated content for a translation string at some point in the future.

This calls for some thoughts and there are a few routes we could take (pooled from the Dev catch-up)

1. Leave it to the services

Do nothing and trust the services to pass sanitised strings for us to inject. Maybe at some point in the future, introduce some sanitisation to limit HTML that can be injected

Advantages

Impact

Concerns

2. Sanitise HTML before injection

Before injecting the content on the page, sanitise the strings to only allow a specific set of elements and attributes.

Advantages

Impact

Cons

3. Have a separate API for text injection and HTML injection

Similarly to the Nunjucks templates, offer two paths for getting content in the DOM:

We'd encourage the use of the first one, potentially flagging the second one as "dangerous". We'd also explain to users that it's down to them to ensure that only sanitised content ends up in the HTML part.

Advantages

Impacts

Concerns

4. Only allow translations to be passed through JavaScript

Restrict the passing of translation strings via JavaScript, to prevent them from appearing on element, possibly making them harder to modify?

Advantages

Impacts

Concerns

5. Add a separate interpolation for visually hidden text

Ultimately, the reason for using innerHTML is to inject <span class="govuk-visually-hidden">...</span>s to help accessibility, so we may want to focus on doing just that rather than trying to allow any HTML.

This could take the form of a special marker in translation strings to wrap content to be visually hidden (as it won't necessarily only be appended).

Advantages

Impacts

Concerns

6. Add a separate translation string for accessible name

We're using visually hidden content to provide a different accessible name. Another route for it could be to allow a translation to be passed for the aria-label attribute and restrict the visible label to be injected only as text. `` Advantages

Impacts

Concerns

Who needs to work on this

Developers

Who needs to review this

Developers

Done when

romaricpascal commented 2 years ago

And an extra option to maybe add in the mix, which I didn't think about this morning:

7. Allow services to provide their own sanitisation function

Before injecting strings in the markup, give the chance to the services to sanitise the HTML that was interpolated by I18n.

Advantages

Impacts

Concerns

querkmachine commented 2 years ago

My quick takes:

In general, XSS potential is entirely dependent on whether a service is injecting user-provided strings directly into HTML without first going through a server-side process.

This isn't an expected use case for any of our own components, nor do I imagine it being a common use case for most others that might use the I18n function, so my overwhelming view is to leave it to the service teams who actually need it to implement, and for us to not burden Frontend with something very few teams will actually need.

I believe there's a separate discussion happening about whether we should be putting visually-hidden text in controls at all (e.g. #2117) due to it preventing voice control users from using them effectively.

1. Leave it to the services Many server-side web languages come with HTML sanitisation functions out-of-the-box, and these would probably do a better job security-wise than anything we can implement in Frontend. Only potential issue I can imagine is HTML being encoded in a way that our JS can't convert back into useable HTML, but that still feels like an issue for the service team to fix rather than us.

2. Sanitise HTML before injection If we do anything, this feels like the 'best fit' approach to me, doing just enough without being overly restrictive. Leaving it to an npm package to handle is probably a good idea given we don't have any particular security expertise on the team, to my knowledge.

3. Have a separate API for text injection and HTML injection Seems like it's probably overcomplicating from both a programming and a documentation perspective (especially character count, which would have 20+ i18n parameters in this situation!)

4. Only allow translations to be passed through JavaScript Probably a non-starter based on our current trajectory and feedback from our proposal/pre-release.

5. Add a separate interpolation for visually hidden text @36degrees proposed doing this previously. I'm not extremely opposed to doing this, but it's a single solution that solves a single problem. It requires some design/documentation around custom syntax rules used in our i18n JS that isn't necessarily consistent elsewhere (i.e. Nunjucks), and I don't personally think it scales well if it turns out service developers need other, similar functionality (bold text, links, etc.)

6. Add a separate translation string for accessible name I think this could be viable but may require some a11y testing to validate first, especially on non-interactive and non-focusable elements.

7. Allow services to provide their own sanitisation function Same as option 1.

36degrees commented 2 years ago

Another issue with 2 is that HTML that's output from in JavaScript will be sanitised, but HTML output from Nunjucks will not. So we'd be introducing an inconsistency that users might not appreciate, and could potentially cause issues – it feels slightly dangerous to sometimes but not always sanitise HTML.

romaricpascal commented 2 years ago

My initial post was just a collection of the options we thought about in the dev catch-up on Monday. On my side:

1. Leave it to the services

Not opposed to it given the scope of our library and the strings being translated which limit the risks of user facing content coming in, but see 7.

2. Sanitise HTML before injection

Feels like the "ideal thing to do" and I'd be rooting for it. However, as @36degrees points out, we wouldn't treat all strings the same between Nunjucks and client-side, but provide the same ...Html options for setting text in components which will be confusing. I believe given we want our templates to just be imported without having people to set filters in their Nunjucks environment, we can't really get Nunjucks to sanitise. So sounds like this option will have to be left out (for now?)

3. Have a separate API for text injection and HTML injection and 4. Only allow translations to be passed through JavaScript

Agreed with @querkmachine for both of them, doesn't sound like a workable way to go.

5. Add a separate interpolation for visually hidden text

Not a fan of this one for two reasons: it's potential for conflict existing formatting syntax and it's narrow scope that won't scale should we have needs for other tags.

6. Add a separate translation string for accessible name

My understanding was that this would only apply to the buttons and not non-interactive elements like the characted count, as they're the ones we need to pass HTML to at the moment. But that would make this a very ad-hoc fix and not very future proof should another component require some hidden text outside of an interactive element, so that may need a good deal of testing to make sure we're not breaking anyone's experience. On top of it, the naming conventions in our API would make it a bit weird as we'd have an <translationKeyAccessibleName>Html option (though we could also use AccessibleName as a marker for passing translation keys.

7. Allow services to provide their own sanitisation function

Technically they can already do this by overriding I18n.prototype.t to post-process the resulting string. I felt that providing a clear API would cover all grounds (like I18n.setSanitiser((html) => {...}) or GovUK.setSanitiser((html) -> {...} depending on where we think the responsibility falls) :

That's something we could document once in the config & i18n API and be done with, regardless of if new components start using the i18n API in places that may

This would also not bind us to a certain way of sanitising HTML (which we could always introduce in a future breaking change).

colinrotherham commented 2 years ago

Another issue with 2 is that HTML that's output from in JavaScript will be sanitised, but HTML output from Nunjucks will not. So we'd be introducing an inconsistency that users might not appreciate, and could potentially cause issues – it feels slightly dangerous to sometimes but not always sanitise HTML.

For safety reasons, I'd also feel more comfortable with an "untrusted by default" approach

Following Nunjucks, where you can opt-in to render (safe) content with escaping turned off:

{{ params.html | safe if params.html else params.text }}

Admittedly it's difficult to exploit, but we currently allow components to render, escaping turned on, then still execute untrusted JavaScript. Here's one for Accordion "Show all" button .innerHTML demonstrated:

Executable <img> error handler

Example staff-facing "case management" service

1. Accordion renders example ${personName} placeholder

This is rendered using params.hideAllSectionsHtml in {{ govukAccordion(params) }}

Accordion exploit example

2. Exploit is carefully crafted

Whilst .innerHTML won't execute code, we can inject other HTML that will

<img src="dodgy" onerror="alert('Exploit')">

3. Person changes their name to include the exploit

personName: `Colin Rotherham <img src="dodgy" onerror="alert('Exploit')">`

4. Page is loaded, rendering untrusted content

Notice how escaping is turned on (best practice) but the exploit is still included

<div
  class="govuk-accordion"
  data-module="govuk-accordion"
  data-i18n.hide-all-sections="Colin Rotherham &lt;img src=&quot;dodgy&quot; onerror=&quot;alert(&apos;Exploit&apos;)&quot;&gt;"
>

4. Exploit runs

The <img> src is invalid which runs the onerror handler

Accordion exploit example showing alert dialog
romaricpascal commented 2 years ago

Did a bit of digging, as it occured to me while commenting on Slack that our Nunjucks template may also lead to HTML landing on the page. We got a good few ...Html or <something.html> options allowing users to pass custom HTML that our components output as is on the page.

We leave it to the services to sanitise whatever they pass:

You must sanitise any HTML you pass in to Nunjucks macros you’re using in your live application to protect your website against cross-site scripting (XSS) attacks. You can read more about XSS on the MDN website.

So rather than a heavy technical solution, which our options and discussions have been leaning towards so far, it may be a matter of documentation. Escaping was enough when the HTML passed to our component was rendered server-side (though kind-of breaks the point of passing HTML as all tags will be escaped). For the I18n messages, we'll need to emphasise on actual sanitisation (ie. tidy up of tags and attributes to only allow what's necessary). This could be by pointing to the OWASP documentation about it and mention that languages have libraries to take care of that complexity server side.

Besides that documentation, we may want to do a minor API update and pick a different suffix for the options related to I18n (Translation, or Message maybe). Reason for it would be to set these options apart from an option like titleHtml in the Panel and help people understand where they need strong sanitisation vs. simply escaping (though if they only escaped the content as a form of sanitisiation options, like titleText would do just the same).

romaricpascal commented 2 years ago

@36degrees @colinrotherham We mentionned the aria-label route at then end of the milestone review, and this would be the steps to make it happen I believe:

  1. rename translation keys from ...Html to ...Text as they'll now be injected by text, both on the Accordion (4 keys) and CharacterCount (6 keys)
  2. replace injection of content in the CharacterCount to use textContent (2 places, visible and visually hidden announcement)
  3. replace injection of content in the Accordion to use textContent (2 places, setExpanded and updateShowAllButton
  4. add 2 new show/hideSectionAriaLabelText (or just ...AriaLabel) options to the macro and in the default JavaScript configuration and update setExpanded and updateShowAllButton to use them for labelling the buttons
  5. check that the accessible name gets conveyed properly to assistive tech now we've changed it to use aria-label (feels like a safeish change

4 is a bit more complex than at first glance. The button's full accessible name is not just "Show/Hide + visually hidden text", but constructed of:

a. the heading itself b. a coma for pausing the screenreader c. "Show/Hide" + visually hidden text

That's where having visually hidden content made thing easy. Swap just the last part and we were good.

We wouldn't be able to just aria-label the <span> (no presence in the accessibility tree, I believe, especially as it sits inside a <button>), so we'll need to set it on the <button> and:

That should help assessing the impact of going down that route.

36degrees commented 2 years ago

It's worth bearing in mind that the issue of making visually hidden text localisable is not new – it's just the first time we've needed to combine it with text that's added or updated in the JS layer.

We already have a mix of different techniques across the codebase – we use aria-label in:

Meanwhile, we use a span with the govuk-visually-hidden class in:

I think we have this mix because we're (mostly) doing what makes sense in the context of each component.

So I don't think there's any reason why if we chose to use aria-label here we'd be committing to using aria-label everywhere going forward.


This also isn't the first time we're allowing users to pass us HTML that we then output on the page, with the expectation that it's up to the service team to have made that HTML 'safe'. We already have guidance about this, although it could arguably be stronger.

I think the main differences here are that:

romaricpascal commented 2 years ago

Given what you describe around aria-label, it sounds like it's something we're already relying on in some components, so would be a viable route for us (provided we make the update cautiously to ensure the accessible name remain the same)

I agree it doesn't commit us to using aria-label everywhere, just solves our issue for the Accordion, the same way that it solved things for the other components that use it.


This also isn't the first time we're allowing users to pass us HTML that we then output on the page, with the expectation that it's up to the service team to have made that HTML 'safe'. We already have guidance about this, although it could arguably be stronger.

With the forceescape filter that you describe further down, I think that would put the texts/html injected by JavaScript in the same situation as those injected by Nunjucks on the server. If we're happy with that route for the Nunjucks injected options, and provided we flesh out our documentation a bit more (eg. present options for escaping, point to the OWASP page...), I'd be enclined to think it'd be a decent solution.

  • using HTML in unavoidable in the Accordion if you want to be consistent with the English translation and include a visually hidden part

I'm not sure I follow that statement as using aria-label would be a way out of that one

  • relatedly, we usually provide a way to pass either Text or Html but in this case there is no option to pass a Text version, so users can't 'opt out' and avoid this being a problem

This one would mean we have separate data attributes for the text and html version. This could be done by suffixing the keys, before adding the plural (eg. data-i18n.show-section.text, data-i18n.character-over-limit.html.one) and updating how the components render them (at that point, we'd likely want a I18n.prototype.setTranslatedText(element, key, options) to encapsulate that logic. That doubles the available options as well

  • we are escaping the HTML using the Nunjucks escape filter, but it's being unescaped at the point it's read into the dataset in JS. I think this means that any user-generated content would need to be 'double escaped', and there might be some complexity relating to how Nunjucks escapes strings – I think it generally avoids double escaping unless you explicitly use forceescape?

Do you mean that our templates would use forceescape and that we'd expect users to run escape (or properly sanitise their elements, with something like DOMPurify for ex) if they pass user-provided content to the helper?


How about:

  1. Giving a go at forceescape to check that it would escape already escaped text
  2. If unsatisfactory, look into using aria-label for the accordion.
romaricpascal commented 2 years ago

The govukPluralisedI18nAttributes macro may be a neat place to quickly try the effect of forceescape in isolation 😊

romaricpascal commented 2 years ago

That forceescape filter looks very promising for putting messages injected in JavaScript at the same level as those injected by Nunjucks templates.

There are 3 situations ahead of us:

  1. the string is not escaped
  2. the string was escaped with | escape
  3. the string was escaped by some other means (eg. in the backend prior to getting in the view).

escape would only run the character escaping on 1 and 3. This is because under the hood, it first checks if the content to escape is already marked as a safe string.

Switching to forceescape would mean that whichever way they're escaped, the strings provided as attributes would end up escaped. This means that services that would want to provide user-submitted strings as one of their messages have a cheap way to escape by using the | escape filter. Using set, they could even construct strings where only part of the content will be escaped twice, allowing them to still inject hidden elements if necessary:

{%- set contentWithHiddenText %}
    {{ valueFromUser | e }}<span class="govuk-visually-hidden">{{ someSafeDevValue}}</span>
{% endset %}

forceescape also provides a bit of safety in case we receive a string that went throught | safe. With only escape if the string had any quote, it'd close the attribute and wreck the markup, while with forceescape that quote will get escaped and we'll get proper render. So, all in all, a move I think we should make.

However...

forceescape only helps people using the Nunjucks templates. If you only use our JavaScript on top of HTML generated some other way, the burden of ensuring proper escaping will be down to whatever generates the HTML.