Open tokideveloper opened 6 years ago
Concerning automated link translation, see this idea.
Any comments?
Use relative links instead of absolute ones?
I recently updated the documentation guidelines on this point: https://www.qubes-os.org/doc/doc-guidelines/#markdown-conventions
(In short: Yes, please use relative instead of absolute paths.)
Use relative links instead of absolute ones?
I recently updated the documentation guidelines on this point: https://www.qubes-os.org/doc/doc-guidelines/#markdown-conventions
(In short: Yes, please use relative instead of absolute paths.)
I'm not sure if we talk about the same thing. Maybe I used the term "relative link" ambiguous.
With a "relative link" I mean rather a "relative path" (not URL) in the sense that the path does not begin with a slash /
, like local paths on a Linux machine. However, URLs are always absolute in my understanding.
For example, while https://www.qubes-os.org/doc/doc-guidelines/
and /doc/doc-guidelines/
are absolute paths following my definition, the paths ../
, ../../intro/
and intro/
are relative ones. (Let's say that these relative links exist on the page /doc/doc-guidelines/
then they would lead to /doc
, /intro
and /doc/doc-guidelines/intro
respectively. See my prototype.)
Oh, I see. Yes, I think we're talking about two different things. My main concern is to avoid https://www.qubes-os.org/doc/
in favor of /doc/
, since the former prevents easy navigation on a locally-served copy of the website.
Oh, I see. Yes, I think we're talking about two different things. My main concern is to avoid https://www.qubes-os.org/doc/ in favor of /doc/, since the former prevents easy navigation on a locally-served copy of the website.
I see. Thank you.
So, now I want to discuss the use of
../doc-guidelines
),/doc/doc-guidelines
) and{{ page.langprefix }}/doc/doc-guidelines
).(*) I tried to set a variable langprefix
within the Liquid code of my langswitch prototype, hoping that the variable would exist when printing the {{ content }}
, but it does not seem to work.
Hint: When I tried out "prefixed paths", some strange behaviour appeared (paths with a literally leading slash in the source MD file became relative ones in the produced HTML files). So, one should test "prefixed paths" with all possibilities of creating links in advance.
Why would absolute paths have to be localized manually when the others don't?
Why would absolute paths have to be localized manually when the others don't?
Let's say that, for example, the page /doc/doc-guidelines/
shall link to /doc/
.
If this is done by the absolute path /doc/
then translators have to translate it to /de-DE/doc/
.
On the opposite, a relative path like ../..
, pointing to /doc/
, must be translated to ../..
, too. Thus, no "translation" is needed.
Also, the "prefixed path" {{ page.langprefix }}/doc/
does not need to be "translated" (it's still {{ page.langprefix }}/doc/
in the translated version). However, the prefix {{ page.langprefix }}
must already exist in the canonical version (and therefore has to be inserted, but only once for all translations). In addition, the value for page.langprefix
must be set in the YAML front matter (in this example to the value /de-DE
), but this can easily be done by an awk script or something.
Thus, both relative and "prefixed" paths don't need an explicit translation. They are already translated implicitly.
Any reason we can't just do a recursive find-and-replace? Something like:
$ find . -type f -print0 | xargs -0 sed -i 's#/doc/#/de-DE/doc/#g'
Any reason we can't just do a recursive find-and-replace? Something like:
$ find . -type f -print0 | xargs -0 sed -i 's#/doc/#/de-DE/doc/#g'
I think it's hard to decide whether a string is a link or not if you don't use a MD/HTML/YAML parser.
But even if we would use an appropriate parser, there could be corner cases where it's still hard to decide.
Let's say there are these lines:
<a href="/">To the root directory of the canonical/English/official version.</a>
...
<a href="/">To the root directory of the localized version in your language.</a>
...
<img src="/to/the/language-independent/logo.png">
...
Use `[here I am][/somewhere/in/the/repo]` to create a labeled link.
...
<a href="http://example.org/doc/">To the doc's root directory on another planet.</a>
The slashes must be interpreted differently, depending on the context, and thus, they could need different translations.
Why not simply run different commands on .md
and .html
files, or do the recursive find-and-replace only on the .md
files (which are the vast majority), then manually edit the .html
files?
Why not simply run different commands on
.md
and.html
files, or do the recursive find-and-replace only on the.md
files (which are the vast majority), then manually edit the.html
files?
Okay, I see that the vast majority should be handled automatically while some corner cases should be inspected manually. So, what about this compromise:
/de-DE
etc.In more detail:
(1) First, we get a list of all existing permalinks (like /
, /doc/
etc.):
cd REPO
grep -re 'permalink: ' . | grep --invert-match -e './_config.yml' | cut -f 2 -d' ' | grep -e '^/' | sort
(2) Then we copy the files of the canonical version to a dedicated directory, let's call it new_lang_prefixed_DATETIME
where DATETIME
is the current date and time.
(3) There, into all files, we automatically insert a (hopefully) unique prefix like %LangPrefix%
in front of all permalink strings that look like translatable paths, depending on the language HTML
/MD
/YAML
etc., for example:
[/doc/]
to [%LangPrefix%/doc/]
in MD
files,(/)
to (%LangPrefix%/)
in MD
files,permalink: /doc/anti-evil-maid/
to permalink: %LangPrefix%/doc/anti-evil-maid/
in the YAML
front matters,href="/doc/"
to href="%LangPrefix%/doc/"
in HTML
files andsrc="/"
to src="%LangPrefix%/"
in HTML
files.This way, at least all paths should be covered. Hopefully, we won't miss any path.
(4) In a next step, we manually check all occurrences of %LangPrefix%
that they shall be transformed to /de-DE
etc. in the final files. If there is a failed check then we replace the prefix %LangPrefix%
with %NoLangPrefix%
.
(5) Upload the files to Transifex and tell the translators not to translate these special prefixes.
(6) Then we automatically go through all translation languages and all translated files and modify them by replacing all occurrences of %LangPrefix%
with /de-DE
etc. and %NoLangPrefix%
with the empty string.
(7) In the future, when some of the canonical files change then we copy only the modified files to a new new_lang_prefixed_DATETIME
directory and repeat the steps as described above only on the differences to the least recently new_lang_prefixed_DATETIME
directory (via an appropriate use of the diff
tool, for example). This way, we will reduce efforts and focus only on the changes.
Of course, obsolete new_lang_prefixed_DATETIME
directories may be removed. The directories might be useful if a new translation language appears since the newest version of a path-prefixed and manually inspected file should be uploaded. So, the directories would work as a cache.
EDIT: I swapped steps 5 and 6 to be able to upload only language-independent versions.
Of course, obsolete
new_lang_prefixed_DATETIME
directories may be removed. The directories might be useful if a new translation language appears since the newest version of a path-prefixed and manually inspected file should be uploaded. So, the directories would work as a cache.
Another method could be to override the files in new_lang_prefixed_DATETIME
with newer versions, rather than storing new versions in their own directories. Thus, only one new_lang_prefixed_DATETIME
directory is needed, making the suffix _DATETIME
superfluous.
In the algorithm above, I forgot the redirect-from
links. So, whenever it's about permalink
s then all redirect-from
links must be considered, too.
Okay, I see that the vast majority should be handled automatically while some corner cases should be inspected manually. So, what about this compromise: [...]
It sounds like this procedure would be something the localization team (including you) performs. If it doesn't entail any changes to the canonical English documentation, the details of the procedure for accomplishing the agreed-upon end result are up to you.
If it doesn't entail any changes to the canonical English documentation, the details of the procedure for accomplishing the agreed-upon end result are up to you.
Okay, thank you! Of course, we'll try to minimize possible impacts on the canonical English documentation. But some things for that are not yet clear for me:
Where (directory and/or repo) can we put all our folders (languages, doc etc.) and files (content, layout etc.) concerning translations?
@marmarek is going to make (a) separate submodule(s) for the actual translated content (#2925).
The "unverified translation" warning layouts are trickier. For example, we can't allow unverified translations of the warning itself, since a malicious translator could alter the warning such that it's no longer about the translation being unverified. So, those will probably have to stay in the main repo.
When it's about going live, the canonical English documentation should insert a language switch which contains links that are labeled with translated words and pointing to unofficial (i.e. translated) pages. Thus, (a) some minor adjustments on the canonical documentation and (b) some trust in translators etc. seem to be necessary. How to handle this?
I think this is what #2930 is about.
The "unverified translation" warning layouts are trickier. For example, we can't allow unverified translations of the warning itself, since a malicious translator could alter the warning such that it's no longer about the translation being unverified. So, those will probably have to stay in the main repo.
I see and agree. But how can we verify that a translation of the warning is correct? Spontaneously, I got this idea: We enter the translated warning into several translation machines, let each machine translate the string into all languages we know well enough and then we check the translations for plausibility.
Sounds good to me. Similarly, given how short the warning is, we could try to have multiple (hopefully) independent human translators translate (or verify) it for each language.
How to deal with fragments (*) in links?
Let me explain why it is problematic. The main concerns are the headings which get IDs created by the Markdown processor.
Let's say a translator wants to translate a link with fragment /file/#good-morning
pointing to the heading Good Morning!
in the document /file/
.
To know how to translate it correctly, for example into German, the translator has to do several steps:
/file/
in the YAML header.)Good Morning!
.Good Morning!
, which is Guten Morgen!
. If it's not yet translated then translate it first.Guten Morgen!
to guten-morgen
) to match the ID the heading Guten Morgen!
will have after processing the MD file to an HTML file.guten-morgen
) as the translated fragment. The resulting link is /de-DE/file/#guten-morgen
(note that inserting /de-DE
is another problem not discussed in this post).(Note that step 2 and subsequent ones are different if there is no heading but any HTML element with that ID.)
These steps are cumbersome, error-prone and inconvenient. Also, if someone changes a header again then all related links/URLs have to be found and adapted again.
To deal with it in a better way, I suggest the following solution. The translator does NOT translate any fragments. Instead, a machine inserts additional empty anchors into the headings in the resulting HTML files. The IDs of these new anchors match the IDs of the appropriate headings in the canonical version.
Following the example:
<h3 id="good-morning">Good Morning!</h3>
.<h3 id="guten-morgen">Guten Morgen!</h3>
.good-morning
from step 1 to a new anchor within the heading in step 2: <h3 id="guten-morgen"><a id="good-morning"></a>Guten Morgen!</h3>
. (Note: Skip step 3 if both IDs in the result would be equal.)
This way, the fragments given in the canonical files will also work with(in) the translated files. Thus, /de-DE/file/#good-morning
(and /de-DE/file/#guten-morgen
) will work.
Hmm, this looks like applying such fixups in md file wouldn't work. Which means translated offline documentation will be slightly limited. IMO it would be desirable to come back to the idea of having all changes applied in md files (maybe some layouts changes for that?). But we can go back to this later.
Just before I forget it: Another reason for fixups after Jekyll-execution is the translation of redirecting pages.
But this could also be done by a specific execution of Jekyll while there is a dedicated (i.e. language-dependent) customized redirect template /_layouts/redirect.html
.
How to translate links (without a fragment) in general?
It's quite late for this important question. So, here we go:
The URL path of a translated page shall get a language-(region?-)dependent super-directory and the rest of the URL shall remain as it does for the canonical version.
Example: The German version of https://www.qubes-os.org/doc/contributing/
shall be https://www.qubes-os.org/de/doc/contributing/
or https://www.qubes-os.org/de-DE/doc/contributing/
, depending on the language code we want to use.
Also see this post.
Which language code ("English", "en", "en-US", "eng" etc.) to use to differ the languages?
Currently, en
is used as redirections to the canonical version. It's a language code without a specified region.
Instead, I would prefer the format LANGUAGE-REGION
as listed in this ISO table (beside region-less codes). Pros are:
en-GB
or en-US
)). Note that currently, e.g. "color" and "colour" coexist in the documentation.bg
, meaning "background" or such, could also be the name of a top directory in the canonical version, colliding with bg
for "Bulgarian". Contrarily, bg-BG
is probably not "background-BackGround" or such.One thing on the downside is that we would have to add redirections from (or permalinks to?) the en-US
versions (The canonical version is written in American English, isn't it?) in the YAML front matters. Also note that Wikipedia seems to be fine with region-less language codes for their sub-domains.
How to deal with the permalink URLs of the canonical version? I see two main ways:
en-US
top directory),en-US
top directory.While the first one will
the latter has the advantage that all paths would start with a language code, making them consistent. I'm open for both options.
What do you think?
Definitely this one:
We don't touch them (i.e. don't add an en-US top directory),
No language or region code for the canonical URLs. (And this is not elitism about English, BTW. I would say the same thing if the documentation were in any other language.)
There are good reasons that no major website has language or region codes in any of their canonical URLs. However, if anyone can provide a counterexample (of a major website that does this), I'd be interested to see it.
Other than that, sounds good to me.
Definitely this one:
We don't touch them (i.e. don't add an en-US top directory),
No language or region code for the canonical URLs. (And this is not elitism about English, BTW. I would say the same thing if the documentation were in any other language.)
There are good reasons that no major website has language or region codes in any of their canonical URLs. However, if anyone can provide a counterexample (of a major website that does this), I'd be interested to see it.
Entering the URL to the website of Mozilla https://www.mozilla.org/ redirects to https://www.mozilla.org/de/ for me.
Entering https://www.mozilla.org/en/ redirects to https://www.mozilla.org/en-US/ in my case.
There is also a language switch on the bottom offering other languages.
It seems that they both use LANGUAGE-REGION
and LANGUAGE
mixed. The only rule I see there is: If there are at least two translations into the equal language but with different regions then use LANGUAGE-REGION
. (Otherwise, use LANGUAGE-REGION
or LANGUAGE
.)
There are also codes which aren't in the mentioned list, e.g. Frysk (fy-NL
). Don't know where it's from.
EDIT: Interestingly, when I visit https://www.mozilla.org/de/ using the text web browser elinks
then I can see a list of links on top of the page. These links point to the available languages. The two top-most links are:
A "canonical" link on Wikipedia also points to the German version in my case. So, maybe we don't really understand "canonical"? END OF EDIT.
Interesting. I agree that this is a good counterexample, and I agree that what you describe in your edit is puzzling. I think both approaches are reasonable. In our case, it might still make sense to leave the canonical English version without a language code, since there's no way our localization will be as thorough as Mozilla's anytime soon.
:+1: for keeping canonical version without language code - if nothing else, to clearly mark it as canonical one.
As for language codes with or without region - indeed adding region code seams reasonable.
Thank you both Andrew and Marek!
Let's summarize it:
LANGUAGE-REGION
.However, for internal processing purposes only, I suggest to use en
for the official version. Reasons:
en
is currently used in the redirection paths. So, I'll just use an existing name and won't create an additional one.en
is neither en-US
nor en-GB
and thus fits our current "needs" of using an "almost-English" language due to the lack of native speakers.en
doesn't steal either en-US
or en-GB
and thus could be adapted in the future in case we get ample man power of native speakers.However, for internal processing purposes only, I suggest to use
en
for the official version.
I guess it depends on what practical effects this will have on our workflow. If it only happens inside of scripts (i.e., documentation contributors and maintainers don't have to change anything), then I'm on board.
However, for internal processing purposes only, I suggest to use en for the official version.
I guess it depends on what practical effects this will have on our workflow. If it only happens inside of scripts (i.e., documentation contributors and maintainers don't have to change anything), then I'm on board.
@andrewdavidwong I see. I'm not sure yet but we'll see.
@andrewdavidwong, @marmarek
I reviewed my algorithm shown in a previous post. Here are my outcomes:
permalink
s and redirect_from
links as listed in the YAML front matters of all files.%UndecidedLangPrefix%
. The resulting state of the file may be called "UndecidedVersion".Decision.patch
generated during step 7 of the last run. Rejected hunks may be ignored or even deleted.%UndecidedLangPrefix%
placeholder within the file then notify a person responsible to do this:
%UndecidedLangPrefix%
with %LangPrefix%
if the concerned links have to be translated (most frequent case).%UndecidedLangPrefix%
with %NoLangPrefix%
if the concerned links must not be translated (probably seldom).%UndecidedLangPrefix%
in the file. If there is one then go back to step 4.Decision.patch
.%LangPrefix%
EDIT and %ExtraLangPrefix%
END EDIT with /de-DE
.%NoLangPrefix%
with the empty string.By using the patch Decision.patch
, we'll save time in the next runs since only these spots of %UndecidedLangPrefix%
must be adapted where the patch couldn't be applied.
EDIT As an additional step between 5 and 6 or between 7 and 8: Where necessary, add %ExtraLangPrefix%
labels in front of all paths to translate that erroneously have not been detected. Save it as a patch and apply that patch in an earlier step in future runs. END EDIT
Of course, already existing sub-strings in the original files that are equal to the placeholders have to be escaped/treated specially.
If a demo example is needed then I'll write and post one.
How to deal with fragments (*) in links?
I've got a new idea concerning link fragments. Instead of viewing it as a problem of translation we view it as a problem of contribution to the English Markdown files (a new Markdown convention rule).
If a contributor wants to link to a specific section Example Section
then he/she does not link to the anchor generated by GitHub Pages (#example-section
) but to a new anchor like #link--example-section
and he/she places the new anchor just below the section title. Thus the fragments are not to be translated anymore.
Example:
Instead of
See [below](#example-section).
Example Section
---------------
Example text.
a contributor has to write something like
See [below](#link--example-section).
Example Section
---------------
<a id="link--example-section" />
Example text.
Existing fragments should be adapted to point to new anchors, of course.
What do you think? @marmarek @andrewdavidwong
How to deal with fragments (*) in links?
I've got a new idea concerning link fragments. Instead of viewing it as a problem of translation we view it as a problem of contribution to the English Markdown files (a new Markdown convention rule).
If a contributor wants to link to a specific section
Example Section
then he/she does not link to the anchor generated by GitHub Pages (#example-section
) but to a new anchor like#link--example-section
and he/she places the new anchor just below the section title. Thus the fragments are not to be translated anymore.Example:
Instead of
See [below](#example-section). Example Section --------------- Example text.
a contributor has to write something like
See [below](#link--example-section). Example Section --------------- <a id="link--example-section" /> Example text.
Existing fragments should be adapted to point to new anchors, of course.
What do you think? @marmarek @andrewdavidwong
I'm not a fan of this approach, because it clutters the source text with HTML that makes things more difficult for both the reader and the writer while also avoiding one of the great features of Markdown. This sort of simplicity and web compatibility is one of the reasons we write the docs in Markdown in the first place. I also think it'll be hard to enforce the convention of including an id anchor for every section link.
I'm not a fan of this approach, because it clutters the source text with HTML that makes things more difficult for both the reader and the writer while also avoiding one of the great features of Markdown. This sort of simplicity and web compatibility is one of the reasons we write the docs in Markdown in the first place. I also think it'll be hard to enforce the convention of including an id anchor for every section link.
I agree that such a convention is hard to enforce and that avoiding this feature of Markdown is not so good. So, I tried to get closer to the other solution above and wrote a script which inserts anchors with the original ids to the headings in the translated MD files.
Let the input be the original MD file
---
Yaml front matter
---
Heading 1
=========
Some text.
Heading 2
---------
Some text.
~~~
# pseudo heading 1
pseudo heading 2
================
pseudo heading 3
----------------
~~~
# Heading 3
Some text.
## Heading 4
Some text.
and its translated version (de)
---
Yaml front matter
---
Überschrift 1
=============
Etwas Text.
Überschrift 2
-------------
Etwas Text.
~~~
# pseudo heading 1
pseudo heading 2
================
pseudo heading 3
----------------
~~~
# Überschrift 3
Etwas Text.
## Überschrift 4
Etwas Text.
then the script produces
---
Yaml front matter
---
Überschrift 1
=============
<a id="heading-1"></a>
Etwas Text.
Überschrift 2
-------------
<a id="heading-2"></a>Etwas Text.
~~~
# pseudo heading 1
pseudo heading 2
================
pseudo heading 3
----------------
~~~
# Überschrift 3
<a id="heading-3"></a>
Etwas Text.
## Überschrift 4
<a id="heading-4"></a>Etwas Text.
to stdout.
It assumes that the original file and its translated version match line by line (ignoring the YAML front matter) and the script does not change the number of lines in the output, too (for other scripts that need that feature).
The script depends on having kramdown
installed. It runs kramdown
several times (twice for each real heading, once for each pseudo heading and once in general) and thus needs some time.
The script ignores quoted headings and is not the cleanest solution (it uses an artifice) but it should work for our purposes.
I think it is okay to modify the translated MD file rather than modifying the HTML files rendered by Github Pages (the latter seems a little bit dirty to me since Github Pages may re-override them at any time).
What do you think?
While this approach is fragile, I like it, because it makes the whole process seamless for documentation writers and translators! While the translated files should still be readable (for offline documentation etc), I don't expect them to be modified directly (we do that through Transifex), so that's ok too. I think our approach to translated files (download them from Transifex, post-process by scripts, but do not modify manually) is compatible with it. One thing that could make it less fragile, would be count not lines, but headers - like annotate first translated header with id of first original header. But since (I think) Transifex does not change line numbers, it isn't strictly necessary.
While this approach is fragile,
I made it a little bit more robust, see here.
One thing that could make it less fragile, would be count not lines, but headers - like annotate first translated header with id of first original header. But since (I think) Transifex does not change line numbers, it isn't strictly necessary.
If there is time we could make it great. ;-)
EDIT: I made it more robust and much quicker, see here. This solution needs only one run of kramdown
for each (pseudo) heading.
Now, I've written a Ruby port of the script described above. It's a lot faster than the Python version since there is no spawning of Ruby kramdown
processes.
In order to specify a translation workflow/guidelines, we need to specify how to translate/localize links within the Qubes OS (doc) website. In this specific issue, I would like to discuss ways to do so.
Here are some key questions (checked if solved):
(*) A fragment is the part after a hash sign ("#"), here: leading to a specific header on the linked page. (**) "en" seems to be the currently used one. See the
redirect_from
lists in the YAML front matters in the Markdown files.Related issues:
2824
1452
1333