Closed exterkamp closed 5 years ago
This is already a problem in some of our strings. The translators usually kept the tags unchanged, but sometimes treated things like attribute values as you would the alt
tag on an image and translated it. e.g. <link rel=preload>
became <link rel=предварительную загрузку>
in Russian.
Apparently the "proper" way to do this is to have the html inserted as an ICU text replacement. This makes it guaranteed that the translators won't translate them. It's also nice to give an example to the translators of what will be in the replacement...in this case, since it's just a literal string, the example is exactly what will be put in the replacement.
We didn't see another easy way to be able to mark within a string "don't translate this section", at least with the format we've chosen (the XML form has a little more flexibility, but come on).
The replacement approach has pretty rotten UX in our current i18n setup, however, since the substrings are just literals, not used as markup, and they aren't dynamic. The string and the html substring end up separated and it becomes difficult to read.
We could:
i18n.js
that automatically extracts substrings and adds them to the replacements that will be done when the formatted string is retrieved (kind of Function.p.bind
for our strings). We could automatically add a mention of the extracted substring to the description in that case?add support for some kind of demarcator to i18n.js that automatically extracts substrings and adds them to the replacements that will be done when the formatted string is retrieved
This was my thought as well.
always remember to put something in the description to leave part of the message untranslated (I believe we've done that in a few descriptions already)
If there are only a few though, this seems far simpler. We could add something in our checker that makes sure anything with <html>
has a DO NOT TRANSLATE X
phrase in its description?
There are a decent number and they are multiplying. We also have a few other substrings we don't want translated, though I forget what we decided about localizing things like "First Meaningful Paint".
I'm also not sure how effective adding a "Do not translate" line to the description is...do the translators always catch it and understand exactly what we don't want translated?
On the other other hand, it'll be hard to always remember to put in double backticked hashtags or whatever to mark do-not-translate blocks, and it's difficult to remember to check for things we missed in the translations that come back for all the languages we don't speak :)
+1 the idea of demarcation. I think leaving it in the descriptions will lead to sometimes translated, sometimes not, and it will be especially hard to keep track if we are having more and more html-in-text strings. We could also statically analyze if <tags>
are always enclosed in an escaped demarcation to avoid forgetting?
RFC @brendankenny @patrickhulce @Hoten @paulirish
I started to look into this more. And I think we need to really pursue this. We should stop sending text like URLs to be translated, it 1.) causes bugs #9000 2.) causes massive no-op round trips (axe 3.1) #7720 3.) is against the tc/ best practice :wink:
Under the hood in GoogleLand we use the Chrome message.json format to convert into xtb
and that message format contains a method of placeholdering that can be used to replace things like URLs and control strings that shouldn't be translated. @brendankenny and I were talking about this, and we merged that format with the UIStrings object to make something like this:
UIStrings = {
{
// Message with nothing special.
message: 'Example Message.',
description: 'Description of the message.',
},
{
// Message with a URL and markdown that shouldn't be sent for translation.
message: 'Example Message. $LINK_START$Learn more$LINK_END$.',
description: 'Description of the message.',
placeholders: {
link_start: {
content: '[',
},
link_end: {
content: '](https://developers.google.com/web/tools/lighthouse/audit/example)'
}
},
},
{
// Message with some HTML tags that we use as an example in our markdown.
message: 'Example Message, $HTML$ is some HTML.',
description: 'Description of the message.',
placeholders: {
html: {
content: '`<link rel=preload>`',
},
},
},
{
// Message with time in ms replacement.
message: 'Example Message, rendered in: $TIME_IN_MILLISECONDS$ ms.',
description: 'Description of the message.',
placeholders: {
time_in_milliseconds: {
content: '{timeInMs, number, milliseconds}',
},
},
},
{
// More canonical replacement message with time in ms replacement.
// But probably not what would work best for us? Or maybe?
message: 'Example Message, rendered in: $TIME_IN_MILLISECONDS$ ms.',
description: 'Description of the message.',
placeholders: {
time_in_milliseconds: {
content: '$1',
example: '53'
},
},
},
...
This would allow us much more flexibility to change our placeholders, markdown syntax, and strings independent of tc/ and would allow us to properly utilize the placeholder syntax and stop sending non-translated words to tc/.
Specifics:
[TEXT](LINK)
is not ideal for translation, and we really want to send just TEXT
to be translated. I have found prior art for <a href=$1>TEXT</a>
style, however the tc/ docs suggest using LINK_START and LINK_END or something of that nature to define the control block around the content for less confusion. We could do something like [TEXT]($URL$)
instead of $LINK_START$TEXT$LINK_END$
but that would allow things like #9000 to happen again.{timeInMs, number, milliseconds}
. But this isn't considered ICU syntax in tc/, it's really just a replacement. And we have to do pre-processing to make it something more like {timeInMs}
anyway. So this should be replaced with placeholders imo. Whether or not this should also be replaced with the $1
syntax and a real placeholder with examples is up for debate. We may want to reach out to the l10n/i18n teams for advisement on this.Thanks for delineating all the cases and nailing this down @exterkamp!
Overall I'm not sure why we need to add the overhead of explicit objects to our strings, but there's no doubt a lot you've thought about already that we haven't so hopefully questions will help :)
$
thing.[]()
links and <html>
snippets. Both of those seem very easily detectable and could be automatically converted into whatever $
format we need for translation console and spliced back together on the return journey at commit time back here. Where there specific cases that don't work with this approach that forced the usage of the structure you've got here?I think to me, the advantage of the objects is that we are closer to a chrome standard format for i18n and that it allows simple placeholder syntax. It isn't strictly required, we can hack together our own replacement conversion and have the strings be processed into this format right before going into en-US, but I think that would be messier (see "2.").
I think that our strings have gotten sufficiently complex that we should start to think about encapsulating these as objects and converging on this messages.json
syntax. We already have messages, descriptions, replacements, ICU, non-translated static blocks, markdown control syntax; honestly, I think that this could use some more structure, esp if we start to follow the placeholder rules better and stop sending non-translated text through.
Specific answers to your questions:
detection -> conversion->translation->re-add
type deal. But this is exactly what the message.json
syntax was made for, and is already a solved problem, and I think that it will be easier to maintain this moving forward if we get closer to the standard, not farther away.All good points! I think my hesitation comes down to it being a Chrome standard I'm less familiar with compared to ICU, so not a very good reason. I only have a total of 1 prior experience with localization anyway so that's not much of a corpus to go on either 😆
SGTM! Just a bummer it'll be a bit extra weight to add strings.
I haven't looked at #9114 yet, but it does feel like we need to get to something like this in the end. I don't think it's worth it to keep special-casing stuff (and that's only going to get worse as we notice more things that need to be not translated, or handled in a special way, or taken out before going to tc and then restored, etc), but otoh we need the solution to be as robust (or more) as the current approach if we're going to make the move.
The concerns I had left after working out the above examples with @exterkamp were
the good news is we already have a large existing corpus for testing (all our i18n strings), so any new solution won't be able to dodge any of these problems :)
oh, more things:
plural
and select
? If so, how are we going to provide information on number formatting, etcnon-google-ICU
) ICU is to use the new placeholder syntax.explanationGender
and explanationGender2
- ICU Plural and Select can but put in as regular, no changes. All other (I'll call it
non-google-ICU
) ICU is to use the new placeholder syntax.- Why is that awful? I don't see it.
- Maybe updates to #9114 address this, you can see more of it there. This has a few special cases that I might need to still work out, e.g. Replacement in ICU, ICU in replacement, but replacement next to ICU should be fine within the same string.
I think I see what you're proposing now with the change in https://github.com/GoogleChrome/lighthouse/pull/9114/commits/74e11c15bf7c7149a5863ae85b0e3a8b1dcebea8#diff-0b037964dfdf09e1e312fb06ffb7f8bcR28
description: {
message: '{link_start}Learn More!!!{link_end}. This audit took {milliseconds} ms.',
placeholders: {
link_start: '[->',
link_end: '](https://developers.google.com/web/tools/lighthouse/audits/minify-css)',
milliseconds: '{timeInMs, number, milliseconds}',
}
}
If real ICU replacements are always inside google placeholders, it seems like it takes care of many issues:
We will have to figure out how the placeholder example
will work in a way that's useful for translators, but I can see you're looking at that :)
Just a note: Google translation will never* accidentally translate an ICU value i.e. {variable}
that is part of the accepted ICU syntax. Its just that {variable, number, second}
is not part of the Google ICU accepted syntax. So we can send {milliseconds}
as part of the string to tc/ (we do this now); however, I think it is better to send something consistent like $millisecond$
for this kind of replacement to remove all doubt from this half-ICU half-replacement syntax.
*they claim
Closing, since #9114 fixed this with code spans
being automatically placeholder'd.
These translated strings contain unescaped HTML which makes the translation linter unhappy.
[user-scalable="no"]
is not used in the<meta name="viewport">
element and the[maximum-scale]
attribute is not less than 5.<meta http-equiv="refresh">
[user-scalable="no"]
is used in the<meta name="viewport">
element or the[maximum-scale]
attribute is less than 5.<input type="image">
elements have[alt]
text<meta http-equiv="refresh">
<input type="image">
elements do not have[alt]
text