QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
543 stars 48 forks source link

Specification and workflow needed: How to translate/localize links within the Qubes (doc) website? #3547

Open tokideveloper opened 6 years ago

tokideveloper commented 6 years ago

In order to specify a translation workflow/guidelines, we need to specify how to translate/localize links within the Qubes OS (doc) website. In this specific issue, I would like to discuss ways to do so.

Here are some key questions (checked if solved):

(*) A fragment is the part after a hash sign ("#"), here: leading to a specific header on the linked page. (**) "en" seems to be the currently used one. See the redirect_from lists in the YAML front matters in the Markdown files.


Related issues:

2824

1452

1333

tokideveloper commented 6 years ago

Concerning automated link translation, see this idea.

Any comments?

andrewdavidwong commented 6 years ago

Use relative links instead of absolute ones?

I recently updated the documentation guidelines on this point: https://www.qubes-os.org/doc/doc-guidelines/#markdown-conventions

(In short: Yes, please use relative instead of absolute paths.)

tokideveloper commented 6 years ago

Use relative links instead of absolute ones?

I recently updated the documentation guidelines on this point: https://www.qubes-os.org/doc/doc-guidelines/#markdown-conventions

(In short: Yes, please use relative instead of absolute paths.)

I'm not sure if we talk about the same thing. Maybe I used the term "relative link" ambiguous.

With a "relative link" I mean rather a "relative path" (not URL) in the sense that the path does not begin with a slash /, like local paths on a Linux machine. However, URLs are always absolute in my understanding.

For example, while https://www.qubes-os.org/doc/doc-guidelines/ and /doc/doc-guidelines/ are absolute paths following my definition, the paths ../, ../../intro/ and intro/ are relative ones. (Let's say that these relative links exist on the page /doc/doc-guidelines/ then they would lead to /doc, /intro and /doc/doc-guidelines/intro respectively. See my prototype.)

andrewdavidwong commented 6 years ago

Oh, I see. Yes, I think we're talking about two different things. My main concern is to avoid https://www.qubes-os.org/doc/ in favor of /doc/, since the former prevents easy navigation on a locally-served copy of the website.

tokideveloper commented 6 years ago

Oh, I see. Yes, I think we're talking about two different things. My main concern is to avoid https://www.qubes-os.org/doc/ in favor of /doc/, since the former prevents easy navigation on a locally-served copy of the website.

I see. Thank you.

So, now I want to discuss the use of

Relative Paths

Advantages
Disadvantages

Absolute Paths

Advantages
Disadvantages

"Prefixed Paths"

Advantages
Disadvantages

(*) I tried to set a variable langprefix within the Liquid code of my langswitch prototype, hoping that the variable would exist when printing the {{ content }}, but it does not seem to work.

Hint: When I tried out "prefixed paths", some strange behaviour appeared (paths with a literally leading slash in the source MD file became relative ones in the produced HTML files). So, one should test "prefixed paths" with all possibilities of creating links in advance.

andrewdavidwong commented 6 years ago

Why would absolute paths have to be localized manually when the others don't?

tokideveloper commented 6 years ago

Why would absolute paths have to be localized manually when the others don't?

Let's say that, for example, the page /doc/doc-guidelines/ shall link to /doc/.

If this is done by the absolute path /doc/ then translators have to translate it to /de-DE/doc/.

On the opposite, a relative path like ../.., pointing to /doc/, must be translated to ../.., too. Thus, no "translation" is needed.

Also, the "prefixed path" {{ page.langprefix }}/doc/ does not need to be "translated" (it's still {{ page.langprefix }}/doc/ in the translated version). However, the prefix {{ page.langprefix }} must already exist in the canonical version (and therefore has to be inserted, but only once for all translations). In addition, the value for page.langprefix must be set in the YAML front matter (in this example to the value /de-DE), but this can easily be done by an awk script or something.

Thus, both relative and "prefixed" paths don't need an explicit translation. They are already translated implicitly.

andrewdavidwong commented 6 years ago

Any reason we can't just do a recursive find-and-replace? Something like:

$ find . -type f -print0 | xargs -0 sed -i 's#/doc/#/de-DE/doc/#g'
tokideveloper commented 6 years ago

Any reason we can't just do a recursive find-and-replace? Something like:

$ find . -type f -print0 | xargs -0 sed -i 's#/doc/#/de-DE/doc/#g'

I think it's hard to decide whether a string is a link or not if you don't use a MD/HTML/YAML parser.

But even if we would use an appropriate parser, there could be corner cases where it's still hard to decide.

Let's say there are these lines:

<a href="/">To the root directory of the canonical/English/official version.</a>
...
<a href="/">To the root directory of the localized version in your language.</a>
...
<img src="/to/the/language-independent/logo.png">
...
Use `[here I am][/somewhere/in/the/repo]` to create a labeled link.
...
<a href="http://example.org/doc/">To the doc's root directory on another planet.</a>

The slashes must be interpreted differently, depending on the context, and thus, they could need different translations.

andrewdavidwong commented 6 years ago

Why not simply run different commands on .md and .html files, or do the recursive find-and-replace only on the .md files (which are the vast majority), then manually edit the .html files?

tokideveloper commented 6 years ago

Why not simply run different commands on .md and .html files, or do the recursive find-and-replace only on the .md files (which are the vast majority), then manually edit the .html files?

Okay, I see that the vast majority should be handled automatically while some corner cases should be inspected manually. So, what about this compromise:

  1. Get a list of all existing permalinks.
  2. Copy all files in the repo. Do the next two steps only on the copies.
  3. Automatically prefix all (permalink) paths in all files with a unique placeholder.
  4. Manually check the placeholders to be in the correct place and nowhere else.
  5. Upload the temporary copy to Transifex.
  6. Automatically replace the placeholders on a temporary copy with the language-dependent path prefix /de-DE etc.
  7. On future changes, do the above steps only on the differences.

In more detail:

(1) First, we get a list of all existing permalinks (like /, /doc/ etc.):

cd REPO
grep -re 'permalink: ' . | grep --invert-match -e './_config.yml' | cut -f 2 -d' ' | grep -e '^/' | sort

(2) Then we copy the files of the canonical version to a dedicated directory, let's call it new_lang_prefixed_DATETIME where DATETIME is the current date and time.

(3) There, into all files, we automatically insert a (hopefully) unique prefix like %LangPrefix% in front of all permalink strings that look like translatable paths, depending on the language HTML/MD/YAML etc., for example:

This way, at least all paths should be covered. Hopefully, we won't miss any path.

(4) In a next step, we manually check all occurrences of %LangPrefix% that they shall be transformed to /de-DE etc. in the final files. If there is a failed check then we replace the prefix %LangPrefix% with %NoLangPrefix%.

(5) Upload the files to Transifex and tell the translators not to translate these special prefixes.

(6) Then we automatically go through all translation languages and all translated files and modify them by replacing all occurrences of %LangPrefix% with /de-DE etc. and %NoLangPrefix% with the empty string.

(7) In the future, when some of the canonical files change then we copy only the modified files to a new new_lang_prefixed_DATETIME directory and repeat the steps as described above only on the differences to the least recently new_lang_prefixed_DATETIME directory (via an appropriate use of the diff tool, for example). This way, we will reduce efforts and focus only on the changes.

Of course, obsolete new_lang_prefixed_DATETIME directories may be removed. The directories might be useful if a new translation language appears since the newest version of a path-prefixed and manually inspected file should be uploaded. So, the directories would work as a cache.

EDIT: I swapped steps 5 and 6 to be able to upload only language-independent versions.

tokideveloper commented 6 years ago

Of course, obsolete new_lang_prefixed_DATETIME directories may be removed. The directories might be useful if a new translation language appears since the newest version of a path-prefixed and manually inspected file should be uploaded. So, the directories would work as a cache.

Another method could be to override the files in new_lang_prefixed_DATETIME with newer versions, rather than storing new versions in their own directories. Thus, only one new_lang_prefixed_DATETIME directory is needed, making the suffix _DATETIME superfluous.

tokideveloper commented 6 years ago

In the algorithm above, I forgot the redirect-from links. So, whenever it's about permalinks then all redirect-from links must be considered, too.

andrewdavidwong commented 6 years ago

Okay, I see that the vast majority should be handled automatically while some corner cases should be inspected manually. So, what about this compromise: [...]

It sounds like this procedure would be something the localization team (including you) performs. If it doesn't entail any changes to the canonical English documentation, the details of the procedure for accomplishing the agreed-upon end result are up to you.

tokideveloper commented 6 years ago

If it doesn't entail any changes to the canonical English documentation, the details of the procedure for accomplishing the agreed-upon end result are up to you.

Okay, thank you! Of course, we'll try to minimize possible impacts on the canonical English documentation. But some things for that are not yet clear for me:

andrewdavidwong commented 6 years ago

Where (directory and/or repo) can we put all our folders (languages, doc etc.) and files (content, layout etc.) concerning translations?

@marmarek is going to make (a) separate submodule(s) for the actual translated content (#2925).

The "unverified translation" warning layouts are trickier. For example, we can't allow unverified translations of the warning itself, since a malicious translator could alter the warning such that it's no longer about the translation being unverified. So, those will probably have to stay in the main repo.

When it's about going live, the canonical English documentation should insert a language switch which contains links that are labeled with translated words and pointing to unofficial (i.e. translated) pages. Thus, (a) some minor adjustments on the canonical documentation and (b) some trust in translators etc. seem to be necessary. How to handle this?

I think this is what #2930 is about.

tokideveloper commented 6 years ago

The "unverified translation" warning layouts are trickier. For example, we can't allow unverified translations of the warning itself, since a malicious translator could alter the warning such that it's no longer about the translation being unverified. So, those will probably have to stay in the main repo.

I see and agree. But how can we verify that a translation of the warning is correct? Spontaneously, I got this idea: We enter the translated warning into several translation machines, let each machine translate the string into all languages we know well enough and then we check the translations for plausibility.

andrewdavidwong commented 6 years ago

Sounds good to me. Similarly, given how short the warning is, we could try to have multiple (hopefully) independent human translators translate (or verify) it for each language.

tokideveloper commented 6 years ago

How to deal with fragments (*) in links?

Let me explain why it is problematic. The main concerns are the headings which get IDs created by the Markdown processor.

Let's say a translator wants to translate a link with fragment /file/#good-morning pointing to the heading Good Morning! in the document /file/.

To know how to translate it correctly, for example into German, the translator has to do several steps:

  1. Find the file the link/URL of the fragment is pointing to. (That is, in the list of files in Transifex, find the MD file with the given permalink /file/ in the YAML header.)
  2. In that file, look for the correct heading of the target of the fragment: Good Morning!.
  3. Look for the translation of Good Morning!, which is Guten Morgen!. If it's not yet translated then translate it first.
  4. Transform a copy of that translation (Guten Morgen! to guten-morgen) to match the ID the heading Guten Morgen! will have after processing the MD file to an HTML file.
  5. Enter that transformed result (guten-morgen) as the translated fragment. The resulting link is /de-DE/file/#guten-morgen (note that inserting /de-DE is another problem not discussed in this post).

(Note that step 2 and subsequent ones are different if there is no heading but any HTML element with that ID.)

These steps are cumbersome, error-prone and inconvenient. Also, if someone changes a header again then all related links/URLs have to be found and adapted again.

To deal with it in a better way, I suggest the following solution. The translator does NOT translate any fragments. Instead, a machine inserts additional empty anchors into the headings in the resulting HTML files. The IDs of these new anchors match the IDs of the appropriate headings in the canonical version.

Following the example:

  1. Let the heading in the (MD-processed) canonical HTML file be <h3 id="good-morning">Good Morning!</h3>.
  2. Let the heading in the (MD-processed) translated HTML file be <h3 id="guten-morgen">Guten Morgen!</h3>.
  3. Add the ID good-morning from step 1 to a new anchor within the heading in step 2: <h3 id="guten-morgen"><a id="good-morning"></a>Guten Morgen!</h3>.

(Note: Skip step 3 if both IDs in the result would be equal.)

This way, the fragments given in the canonical files will also work with(in) the translated files. Thus, /de-DE/file/#good-morning (and /de-DE/file/#guten-morgen) will work.

marmarek commented 6 years ago

Hmm, this looks like applying such fixups in md file wouldn't work. Which means translated offline documentation will be slightly limited. IMO it would be desirable to come back to the idea of having all changes applied in md files (maybe some layouts changes for that?). But we can go back to this later.

tokideveloper commented 6 years ago

Just before I forget it: Another reason for fixups after Jekyll-execution is the translation of redirecting pages.

But this could also be done by a specific execution of Jekyll while there is a dedicated (i.e. language-dependent) customized redirect template /_layouts/redirect.html.

tokideveloper commented 6 years ago

How to translate links (without a fragment) in general?

It's quite late for this important question. So, here we go:

The URL path of a translated page shall get a language-(region?-)dependent super-directory and the rest of the URL shall remain as it does for the canonical version.

Example: The German version of https://www.qubes-os.org/doc/contributing/ shall be https://www.qubes-os.org/de/doc/contributing/ or https://www.qubes-os.org/de-DE/doc/contributing/, depending on the language code we want to use.

Also see this post.

tokideveloper commented 6 years ago

Which language code ("English", "en", "en-US", "eng" etc.) to use to differ the languages?

Currently, en is used as redirections to the canonical version. It's a language code without a specified region.

Instead, I would prefer the format LANGUAGE-REGION as listed in this ISO table (beside region-less codes). Pros are:

One thing on the downside is that we would have to add redirections from (or permalinks to?) the en-US versions (The canonical version is written in American English, isn't it?) in the YAML front matters. Also note that Wikipedia seems to be fine with region-less language codes for their sub-domains.

How to deal with the permalink URLs of the canonical version? I see two main ways:

While the first one will

the latter has the advantage that all paths would start with a language code, making them consistent. I'm open for both options.

What do you think?

andrewdavidwong commented 6 years ago

Definitely this one:

We don't touch them (i.e. don't add an en-US top directory),

No language or region code for the canonical URLs. (And this is not elitism about English, BTW. I would say the same thing if the documentation were in any other language.)

There are good reasons that no major website has language or region codes in any of their canonical URLs. However, if anyone can provide a counterexample (of a major website that does this), I'd be interested to see it.

Other than that, sounds good to me.

tokideveloper commented 6 years ago

Definitely this one:

We don't touch them (i.e. don't add an en-US top directory),

No language or region code for the canonical URLs. (And this is not elitism about English, BTW. I would say the same thing if the documentation were in any other language.)

There are good reasons that no major website has language or region codes in any of their canonical URLs. However, if anyone can provide a counterexample (of a major website that does this), I'd be interested to see it.

Entering the URL to the website of Mozilla https://www.mozilla.org/ redirects to https://www.mozilla.org/de/ for me.

Entering https://www.mozilla.org/en/ redirects to https://www.mozilla.org/en-US/ in my case.

There is also a language switch on the bottom offering other languages.

It seems that they both use LANGUAGE-REGION and LANGUAGE mixed. The only rule I see there is: If there are at least two translations into the equal language but with different regions then use LANGUAGE-REGION. (Otherwise, use LANGUAGE-REGION or LANGUAGE.)

There are also codes which aren't in the mentioned list, e.g. Frysk (fy-NL). Don't know where it's from.

EDIT: Interestingly, when I visit https://www.mozilla.org/de/ using the text web browser elinks then I can see a list of links on top of the page. These links point to the available languages. The two top-most links are:

A "canonical" link on Wikipedia also points to the German version in my case. So, maybe we don't really understand "canonical"? END OF EDIT.

andrewdavidwong commented 6 years ago

Interesting. I agree that this is a good counterexample, and I agree that what you describe in your edit is puzzling. I think both approaches are reasonable. In our case, it might still make sense to leave the canonical English version without a language code, since there's no way our localization will be as thorough as Mozilla's anytime soon.

marmarek commented 6 years ago

:+1: for keeping canonical version without language code - if nothing else, to clearly mark it as canonical one.

marmarek commented 6 years ago

As for language codes with or without region - indeed adding region code seams reasonable.

tokideveloper commented 6 years ago

Thank you both Andrew and Marek!

Let's summarize it:

However, for internal processing purposes only, I suggest to use en for the official version. Reasons:

andrewdavidwong commented 6 years ago

However, for internal processing purposes only, I suggest to use en for the official version.

I guess it depends on what practical effects this will have on our workflow. If it only happens inside of scripts (i.e., documentation contributors and maintainers don't have to change anything), then I'm on board.

tokideveloper commented 6 years ago

However, for internal processing purposes only, I suggest to use en for the official version.

I guess it depends on what practical effects this will have on our workflow. If it only happens inside of scripts (i.e., documentation contributors and maintainers don't have to change anything), then I'm on board.

@andrewdavidwong I see. I'm not sure yet but we'll see.

tokideveloper commented 6 years ago

@andrewdavidwong, @marmarek

I reviewed my algorithm shown in a previous post. Here are my outcomes:

  1. Get a list of all existing permalinks and redirect_from links as listed in the YAML front matters of all files.
  2. Automatically, in the file, prefix all paths that are in that list with the placeholder %UndecidedLangPrefix%. The resulting state of the file may be called "UndecidedVersion".
  3. If available, apply the patch Decision.patch generated during step 7 of the last run. Rejected hunks may be ignored or even deleted.
  4. If there is still an %UndecidedLangPrefix% placeholder within the file then notify a person responsible to do this:
    1. Replace all occurrences of %UndecidedLangPrefix% with %LangPrefix% if the concerned links have to be translated (most frequent case).
    2. Replace all occurrences of %UndecidedLangPrefix% with %NoLangPrefix% if the concerned links must not be translated (probably seldom).
  5. Check that there is no %UndecidedLangPrefix% in the file. If there is one then go back to step 4.
  6. The current state of the file may be called "DecidedVersion".
  7. Save the difference from "UndecidedVersion" to "DecidedVersion" as a patch called Decision.patch.
  8. Upload the file to Transifex and tell the translators not to touch the placeholders.
  9. Download a translated version of that file from Transifex. Let's say it's in German.
  10. Replace all occurrences of %LangPrefix% EDIT and %ExtraLangPrefix% END EDIT with /de-DE.
  11. Replace all occurrences of %NoLangPrefix% with the empty string.

By using the patch Decision.patch, we'll save time in the next runs since only these spots of %UndecidedLangPrefix% must be adapted where the patch couldn't be applied.

EDIT As an additional step between 5 and 6 or between 7 and 8: Where necessary, add %ExtraLangPrefix% labels in front of all paths to translate that erroneously have not been detected. Save it as a patch and apply that patch in an earlier step in future runs. END EDIT

Of course, already existing sub-strings in the original files that are equal to the placeholders have to be escaped/treated specially.

If a demo example is needed then I'll write and post one.

tokideveloper commented 4 years ago

How to deal with fragments (*) in links?

I've got a new idea concerning link fragments. Instead of viewing it as a problem of translation we view it as a problem of contribution to the English Markdown files (a new Markdown convention rule).

If a contributor wants to link to a specific section Example Section then he/she does not link to the anchor generated by GitHub Pages (#example-section) but to a new anchor like #link--example-section and he/she places the new anchor just below the section title. Thus the fragments are not to be translated anymore.

Example:

Instead of

See [below](#example-section).

Example Section
---------------

Example text.

a contributor has to write something like

See [below](#link--example-section).

Example Section
---------------

<a id="link--example-section" />
Example text.

Existing fragments should be adapted to point to new anchors, of course.

What do you think? @marmarek @andrewdavidwong

andrewdavidwong commented 4 years ago

How to deal with fragments (*) in links?

I've got a new idea concerning link fragments. Instead of viewing it as a problem of translation we view it as a problem of contribution to the English Markdown files (a new Markdown convention rule).

If a contributor wants to link to a specific section Example Section then he/she does not link to the anchor generated by GitHub Pages (#example-section) but to a new anchor like #link--example-section and he/she places the new anchor just below the section title. Thus the fragments are not to be translated anymore.

Example:

Instead of

See [below](#example-section).

Example Section
---------------

Example text.

a contributor has to write something like

See [below](#link--example-section).

Example Section
---------------

<a id="link--example-section" />
Example text.

Existing fragments should be adapted to point to new anchors, of course.

What do you think? @marmarek @andrewdavidwong

I'm not a fan of this approach, because it clutters the source text with HTML that makes things more difficult for both the reader and the writer while also avoiding one of the great features of Markdown. This sort of simplicity and web compatibility is one of the reasons we write the docs in Markdown in the first place. I also think it'll be hard to enforce the convention of including an id anchor for every section link.

tokideveloper commented 4 years ago

I'm not a fan of this approach, because it clutters the source text with HTML that makes things more difficult for both the reader and the writer while also avoiding one of the great features of Markdown. This sort of simplicity and web compatibility is one of the reasons we write the docs in Markdown in the first place. I also think it'll be hard to enforce the convention of including an id anchor for every section link.

I agree that such a convention is hard to enforce and that avoiding this feature of Markdown is not so good. So, I tried to get closer to the other solution above and wrote a script which inserts anchors with the original ids to the headings in the translated MD files.

Let the input be the original MD file

---
Yaml front matter
---

Heading 1
=========

Some text.

Heading 2
---------
Some text.

~~~
# pseudo heading 1

pseudo heading 2
================

pseudo heading 3
----------------
~~~

# Heading 3

Some text.

## Heading 4
Some text.

and its translated version (de)

---
Yaml front matter
---

Überschrift 1
=============

Etwas Text.

Überschrift 2
-------------
Etwas Text.

~~~
# pseudo heading 1

pseudo heading 2
================

pseudo heading 3
----------------
~~~

# Überschrift 3

Etwas Text.

## Überschrift 4
Etwas Text.

then the script produces

---
Yaml front matter
---

Überschrift 1
=============
<a id="heading-1"></a>
Etwas Text.

Überschrift 2
-------------
<a id="heading-2"></a>Etwas Text.

~~~
# pseudo heading 1

pseudo heading 2
================

pseudo heading 3
----------------
~~~

# Überschrift 3
<a id="heading-3"></a>
Etwas Text.

## Überschrift 4
<a id="heading-4"></a>Etwas Text.

to stdout.

It assumes that the original file and its translated version match line by line (ignoring the YAML front matter) and the script does not change the number of lines in the output, too (for other scripts that need that feature).

The script depends on having kramdown installed. It runs kramdown several times (twice for each real heading, once for each pseudo heading and once in general) and thus needs some time.

The script ignores quoted headings and is not the cleanest solution (it uses an artifice) but it should work for our purposes.

I think it is okay to modify the translated MD file rather than modifying the HTML files rendered by Github Pages (the latter seems a little bit dirty to me since Github Pages may re-override them at any time).

What do you think?

marmarek commented 4 years ago

While this approach is fragile, I like it, because it makes the whole process seamless for documentation writers and translators! While the translated files should still be readable (for offline documentation etc), I don't expect them to be modified directly (we do that through Transifex), so that's ok too. I think our approach to translated files (download them from Transifex, post-process by scripts, but do not modify manually) is compatible with it. One thing that could make it less fragile, would be count not lines, but headers - like annotate first translated header with id of first original header. But since (I think) Transifex does not change line numbers, it isn't strictly necessary.

tokideveloper commented 4 years ago

While this approach is fragile,

I made it a little bit more robust, see here.

One thing that could make it less fragile, would be count not lines, but headers - like annotate first translated header with id of first original header. But since (I think) Transifex does not change line numbers, it isn't strictly necessary.

If there is time we could make it great. ;-)

EDIT: I made it more robust and much quicker, see here. This solution needs only one run of kramdown for each (pseudo) heading.

tokideveloper commented 4 years ago

Now, I've written a Ruby port of the script described above. It's a lot faster than the Python version since there is no spawning of Ruby kramdown processes.