Unescaped HTML in translation text

exterkamp commented 5 years ago

These translated strings contain unescaped HTML which makes the translation linter unhappy.

[user-scalable="no"] is not used in the <meta name="viewport"> element and the [maximum-scale] attribute is not less than 5.
The document does not use <meta http-equiv="refresh">
[user-scalable="no"] is used in the <meta name="viewport"> element or the [maximum-scale] attribute is less than 5.
<input type="image"> elements have [alt] text
The document uses <meta http-equiv="refresh">
<input type="image"> elements do not have [alt] text

brendankenny commented 5 years ago

This is already a problem in some of our strings. The translators usually kept the tags unchanged, but sometimes treated things like attribute values as you would the alt tag on an image and translated it. e.g. <link rel=preload> became <link rel=предварительную загрузку> in Russian.

brendankenny commented 5 years ago

Apparently the "proper" way to do this is to have the html inserted as an ICU text replacement. This makes it guaranteed that the translators won't translate them. It's also nice to give an example to the translators of what will be in the replacement...in this case, since it's just a literal string, the example is exactly what will be put in the replacement.

We didn't see another easy way to be able to mark within a string "don't translate this section", at least with the format we've chosen (the XML form has a little more flexibility, but come on).

The replacement approach has pretty rotten UX in our current i18n setup, however, since the substrings are just literals, not used as markup, and they aren't dynamic. The string and the html substring end up separated and it becomes difficult to read.

We could:

embrace this and just try to remember to do string replacement if we need a substring untranslated
always remember to put something in the description to leave part of the message untranslated (I believe we've done that in a few descriptions already)
add support for some kind of demarcator to i18n.js that automatically extracts substrings and adds them to the replacements that will be done when the formatted string is retrieved (kind of Function.p.bind for our strings). We could automatically add a mention of the extracted substring to the description in that case?

patrickhulce commented 5 years ago

add support for some kind of demarcator to i18n.js that automatically extracts substrings and adds them to the replacements that will be done when the formatted string is retrieved

This was my thought as well.

always remember to put something in the description to leave part of the message untranslated (I believe we've done that in a few descriptions already)

If there are only a few though, this seems far simpler. We could add something in our checker that makes sure anything with <html> has a DO NOT TRANSLATE X phrase in its description?

brendankenny commented 5 years ago

There are a decent number and they are multiplying. We also have a few other substrings we don't want translated, though I forget what we decided about localizing things like "First Meaningful Paint".

I'm also not sure how effective adding a "Do not translate" line to the description is...do the translators always catch it and understand exactly what we don't want translated?

On the other other hand, it'll be hard to always remember to put in double backticked hashtags or whatever to mark do-not-translate blocks, and it's difficult to remember to check for things we missed in the translations that come back for all the languages we don't speak :)

exterkamp commented 5 years ago

+1 the idea of demarcation. I think leaving it in the descriptions will lead to sometimes translated, sometimes not, and it will be especially hard to keep track if we are having more and more html-in-text strings. We could also statically analyze if <tags> are always enclosed in an escaped demarcation to avoid forgetting?

exterkamp commented 5 years ago

RFC @brendankenny @patrickhulce @Hoten @paulirish

I started to look into this more. And I think we need to really pursue this. We should stop sending text like URLs to be translated, it 1.) causes bugs #9000 2.) causes massive no-op round trips (axe 3.1) #7720 3.) is against the tc/ best practice :wink:

Under the hood in GoogleLand we use the Chrome message.json format to convert into xtb and that message format contains a method of placeholdering that can be used to replace things like URLs and control strings that shouldn't be translated. @brendankenny and I were talking about this, and we merged that format with the UIStrings object to make something like this:

UIStrings = {
{
 // Message with nothing special.
 message: 'Example Message.',
 description: 'Description of the message.',
},
{
 // Message with a URL and markdown that shouldn't be sent for translation.
 message: 'Example Message. $LINK_START$Learn more$LINK_END$.',
 description: 'Description of the message.',
 placeholders: {
  link_start: {
   content: '[',
  },
  link_end: {
   content: '](https://developers.google.com/web/tools/lighthouse/audit/example)'
  }
 },
},
{
 // Message with some HTML tags that we use as an example in our markdown.
 message: 'Example Message, $HTML$ is some HTML.',
 description: 'Description of the message.',
 placeholders: {
  html: {
   content: '`<link rel=preload>`',
  },
 },
},
{
 // Message with time in ms replacement.
 message: 'Example Message, rendered in: $TIME_IN_MILLISECONDS$ ms.',
 description: 'Description of the message.',
 placeholders: {
  time_in_milliseconds: {
   content: '{timeInMs, number, milliseconds}',
  },
 },
},
{
 // More canonical replacement message with time in ms replacement.
 // But probably not what would work best for us?  Or maybe?
 message: 'Example Message, rendered in: $TIME_IN_MILLISECONDS$ ms.',
 description: 'Description of the message.',
 placeholders: {
  time_in_milliseconds: {
   content: '$1',
   example: '53'
  },
 },
},
...

This would allow us much more flexibility to change our placeholders, markdown syntax, and strings independent of tc/ and would allow us to properly utilize the placeholder syntax and stop sending non-translated words to tc/.

Specifics:

LINK_START & LINK_END: Our markdown [TEXT](LINK) is not ideal for translation, and we really want to send just TEXT to be translated. I have found prior art for <a href=$1>TEXT</a> style, however the tc/ docs suggest using LINK_START and LINK_END or something of that nature to define the control block around the content for less confusion. We could do something like [TEXT]($URL$) instead of $LINK_START$TEXT$LINK_END$ but that would allow things like #9000 to happen again.
"ICU Syntax": We use ICU Syntax for time e.g. {timeInMs, number, milliseconds}. But this isn't considered ICU syntax in tc/, it's really just a replacement. And we have to do pre-processing to make it something more like {timeInMs} anyway. So this should be replaced with placeholders imo. Whether or not this should also be replaced with the $1 syntax and a real placeholder with examples is up for debate. We may want to reach out to the l10n/i18n teams for advisement on this.

patrickhulce commented 5 years ago

Thanks for delineating all the cases and nailing this down @exterkamp!

Overall I'm not sure why we need to add the overhead of explicit objects to our strings, but there's no doubt a lot you've thought about already that we haven't so hopefully questions will help :)

I'm not super clear on the last two. Why do we need to get rid of our ICU syntax? I get that we might need to post-process it for Goog but it feels like a step down to go from industry standard syntax to a bespoke $ thing.
I feel like I'll be alone on this one, buuuuuuuuut :) our two main problems are []() links and <html> snippets. Both of those seem very easily detectable and could be automatically converted into whatever $ format we need for translation console and spliced back together on the return journey at commit time back here. Where there specific cases that don't work with this approach that forced the usage of the structure you've got here?

exterkamp commented 5 years ago

I think to me, the advantage of the objects is that we are closer to a chrome standard format for i18n and that it allows simple placeholder syntax. It isn't strictly required, we can hack together our own replacement conversion and have the strings be processed into this format right before going into en-US, but I think that would be messier (see "2.").

I think that our strings have gotten sufficiently complex that we should start to think about encapsulating these as objects and converging on this messages.json syntax. We already have messages, descriptions, replacements, ICU, non-translated static blocks, markdown control syntax; honestly, I think that this could use some more structure, esp if we start to follow the placeholder rules better and stop sending non-translated text through.

Specific answers to your questions:

We don't need to get rid of ICU, we should keep it. It just doesn't need to go to the translators, it is control text that is never translated. Hence being placeholder'd and then re-added. ICU will still be used, but we shouldn't send this non-translation ICU to the translators (in fact we don't it is removed in pre-processing). This is more in line with the third example "Message with time in ms replacement." than the fourth. The fourth example was just a thought, but I don't think that's even how it should be done.
This is reasonable. We can continue down this path and do custom extraction/replacement. Off hand I can't think of anything complex that would need more than a detection -> conversion->translation->re-add type deal. But this is exactly what the message.json syntax was made for, and is already a solved problem, and I think that it will be easier to maintain this moving forward if we get closer to the standard, not farther away.

patrickhulce commented 5 years ago

All good points! I think my hesitation comes down to it being a Chrome standard I'm less familiar with compared to ICU, so not a very good reason. I only have a total of 1 prior experience with localization anyway so that's not much of a corpus to go on either 😆

SGTM! Just a bummer it'll be a bit extra weight to add strings.

brendankenny commented 5 years ago

I haven't looked at #9114 yet, but it does feel like we need to get to something like this in the end. I don't think it's worth it to keep special-casing stuff (and that's only going to get worse as we notice more things that need to be not translated, or handled in a special way, or taken out before going to tc and then restored, etc), but otoh we need the solution to be as robust (or more) as the current approach if we're going to make the move.

The concerns I had left after working out the above examples with @exterkamp were

how does the ICU syntax interoperate with this json format? It's not clear to me how to write the strings that need both kinds of replacement.
going from named property replacements to positional ones is kind of awful. How do we make that better?
(kind of a superset of the two above): how do we make a good pipeline that handles both the replacements above and the ICU replacements while still being usable and lightweight for the caller?

the good news is we already have a large existing corpus for testing (all our i18n strings), so any new solution won't be able to dodge any of these problems :)

brendankenny commented 5 years ago

oh, more things:

are all replacements going to move to this format except plural and select? If so, how are we going to provide information on number formatting, etc
how much bigger are our already quite large language files going to get

exterkamp commented 5 years ago

ICU Plural and Select can but put in as regular, no changes. All other (I'll call it non-google-ICU) ICU is to use the new placeholder syntax.
Why is that awful? I don't see it.
Maybe updates to #9114 address this, you can see more of it there. This has a few special cases that I might need to still work out, e.g. Replacement in ICU, ICU in replacement, but replacement next to ICU should be fine within the same string.
Yes all static replacements should move to this for consistency and for guarantees of non-translation
1. I'm not sure about dynamic replacements, i.e. error message text. That is mentioned and handled in two different examples in #9114 with explanationGender and explanationGender2
Yes

brendankenny commented 5 years ago

ICU Plural and Select can but put in as regular, no changes. All other (I'll call it non-google-ICU) ICU is to use the new placeholder syntax.

Why is that awful? I don't see it.

Maybe updates to #9114 address this, you can see more of it there. This has a few special cases that I might need to still work out, e.g. Replacement in ICU, ICU in replacement, but replacement next to ICU should be fine within the same string.

I think I see what you're proposing now with the change in https://github.com/GoogleChrome/lighthouse/pull/9114/commits/74e11c15bf7c7149a5863ae85b0e3a8b1dcebea8#diff-0b037964dfdf09e1e312fb06ffb7f8bcR28

description: {
  message: '{link_start}Learn More!!!{link_end}. This audit took {milliseconds} ms.',
  placeholders: {
    link_start: '[->',
    link_end: '](https://developers.google.com/web/tools/lighthouse/audits/minify-css)',
    milliseconds: '{timeInMs, number, milliseconds}',
  }
}

If real ICU replacements are always inside google placeholders, it seems like it takes care of many issues:

it won't even be possible to accidentally translate them in tc
it allows interoperation between the two systems
we don't even have to think about positional arguments since we'll take care of those automatically and our (real) ICU formatting can continue as before

We will have to figure out how the placeholder example will work in a way that's useful for translators, but I can see you're looking at that :)

exterkamp commented 5 years ago

Just a note: Google translation will never* accidentally translate an ICU value i.e. {variable} that is part of the accepted ICU syntax. Its just that {variable, number, second} is not part of the Google ICU accepted syntax. So we can send {milliseconds} as part of the string to tc/ (we do this now); however, I think it is better to send something consistent like $millisecond$ for this kind of replacement to remove all doubt from this half-ICU half-replacement syntax.

*they claim

exterkamp commented 5 years ago

Closing, since #9114 fixed this with code spans being automatically placeholder'd.

GoogleChrome / lighthouse

Unescaped HTML in translation text #6945