Create separate untrusted submodule for translated files

andrewdavidwong commented 7 years ago

[Branched from #2652.]

Create separate untrusted submodule for translated files (i.e., no signed commits or tags).

tokideveloper commented 6 years ago

From #3547:

@marmarek is going to make (a) separate submodule(s) for the actual translated content (#2925).

Please let us know the name of that submodule when it's created.

A more general question: Currently, I see multiple repos like qubesos.github.io, qubes-attachment, qubes-posts, qubes-hcl, qubes-doc, qubes-manager etc. My questions here are:

Which repos shall we translate? Which not?
Shall the new submodule contain the translation for all these repos or will there be one submodule per repo to translate?

tokideveloper commented 6 years ago

@marmarek @andrewdavidwong I didn't design everything in detail yet but I think that having two repos would be fine:

one offline "working repo" for this:
- the canonical files which are the base for the current translations going live (maybe including a history of it)
- storing YAML front matters and file remainders separately
- Jekyll processing steps: preprocessing, site generation, postprocessing
- pending translations due to problems with the structure or other no-gos
- scripts for automation
- context support for translators (e.g. web page screenshots.)
- manuals how to do i18n and l10n
- logs
one online "live repo" for this:
- all the files necessary for the web site
  - all the successfully translated HTML files generated in the "working repo" together with their original but translated YAML front matters
  - stored in a file hierarchy as in the canonical version
  - thus, roughly speaking, the files will just be copied by Jekyll into the _site directory without any problems and provided to the public
- no other files

The main reason is that I don't want to break the live repo in case of script errors etc.

Concerning the naming of the new repos: Due to the Wikipedia article about Internationalization and localization, it seems to be okay to name

the first repo qubesos-website-translation or qubes-website-i18n-and-l10n with an optional suffix -working-repo
and the second repo qubesos-website-translated or qubesos-website-localized.

Any suggestions?

marmarek commented 6 years ago

IMO translated repo should contain translated md files, not generated HTML. Not sure how this fits into above repo layout, but I guess it's the first one. There is no need to manually call yekyll - github pages does it for us.

As for scripts - I think we can keep them in qubesos.github.io repo - there is already _utils directory. Such script should download new content from transifex, prepend and/or validate frontmatter (layout, urls, redirect etc - to make sure it wont hijack content from other language, especially English), then commit to repository with translated content.

The majority of content to translate is in qubes-doc repo. There is some in qubesos.github.io and qubes-posts, but not sure if/how we want to handle it. Definitely qubesos.github.io needs to have some translation related files - layouts, language switcher etc. But IMO it can be handled manually (regardless of the script mentioned above). So, I think we need qubes-doc-translated, or maybe even separate one for each language. Any opinion?

tokideveloper commented 6 years ago

IMO translated repo should contain translated md files, not generated HTML. Not sure how this fits into above repo layout, but I guess it's the first one.

To give an overview of how I want to use the "working repo":

I plan to automate translation as much as possible to both aid the translator (e.g. translating links, translating the YAML front matter) and ensure integrity (e.g. see the problem of translating internal links). My current process of translating an MD file, let's call it example.md, looks roughly like this (with fictional file names and locations):

Process example.md, resulting in example_ready-for-transifex.md and example_yaml-front-matter.yml.
Upload example_ready-for-transifex.md to Transifex.
Download a (partially) translated file from Transifex into an appropriate path like de-DE/example_from-transifex.md.
Do some checks on de-DE/example_from-transifex.md to ensure integrity.
Process example_yaml-front-matter.yml using de-DE/example_from-transifex.md, resulting in de-DE/example_yaml-front-matter.yml
Process de-DE/example_from-transifex.md together with de-DE/example_yaml-front-matter.yml, resulting in de-DE/example_ready-for-jekyll.md.
Let de-DE/example_ready-for-jekyll.md be processed by Jekyll, resulting in de-DE/example_from-jekyll.html.
Process de-DE/example_from-jekyll.html, resulting in de-DE/example_ready-for-going-live.html
The file de-DE/example_ready-for-going-live.html may go live now.

So, after being processed by Jekyll, the file de-DE/example_from-jekyll.html is not yet ready to go live. E.g. see the problem of translating fragments.

I just tested if the generated files of Jekyll can be modified manually without being modified by Jekyll again: It's possible, so, yes, in theory, only one repo seems to be sufficient. But there might be a race condition if Jekyll re-generates a file while step 8 is in process.

Also, I recently read on the Jekyll homepage this warning which encouraged me to use a second repo for the ready-for-going-live files:

Destination folders are cleaned on site builds [...] Do not use an important location for \<destination>; instead, use it as a staging area and copy files from there to your web server.

In addition, having only live-going files in a second repo will keep it clean from noise.

There is no need to manually call yekyll - github pages does it for us.

That's okay if the problem of the race condition mentioned above can be solved.

As for scripts - I think we can keep them in qubesos.github.io repo - there is already _utils directory. Such script should download new content from transifex, prepend and/or validate frontmatter (layout, urls, redirect etc - to make sure it wont hijack content from other language, especially English), then commit to repository with translated content.

IMHO, since these new scripts are related to translation, they might be better stored in a translation repo. (Okay, there is the problem with the language switcher which could be solved in the canonical and copied to a translation repo or vice versa.)

The hijacking problem shall be solved by step 1 and step 5 in my process described above.

The majority of content to translate is in qubes-doc repo. There is some in qubesos.github.io

For these, my solution is almost complete.

and qubes-posts, but not sure if/how we want to handle it.

For qubes-posts, I don't have a solution yet.

Definitely qubesos.github.io needs to have some translation related files - layouts, language switcher etc. But IMO it can be handled manually (regardless of the script mentioned above).

Almost everything can be automated - the language switcher, too.

So, I think we need qubes-doc-translated, or maybe even separate one for each language. Any opinion?

I thought about one repo for each language and I think it's cumbersome, inconvenient and hard to maintain. But it might make sense if there are trusted people who sign the commits of the translations.

I also think that one translation repo per canonical repo is inconvenient in a similar way. But, to be honest, I don't know the reason for having the already existing canonical sub-repos.

I think that one repo for both all languages and all files will fit our needs. Note that all content produced during translation is not trusted - at least not yet - so this characteristic "keeps the translates files together". But maybe there are good reasons concerning the use of Git or the separation of concerns.

marmarek commented 6 years ago

I also think that one translation repo per canonical repo is inconvenient in a similar way. But, to be honest, I don't know the reason for having the already existing canonical sub-repos.

The idea is to cleanly separate "website stuff" from "documentation stuff", mostly for #1019.

I think that one repo for both all languages and all files will fit our needs. Note that all content produced during translation is not trusted - at least not yet - so this characteristic "keeps the translates files together". But maybe there are good reasons concerning the use of Git or the separation of concerns.

This is why I propose keeping scripts in separate repo (not necessary new one - that's why I propose qubesos.github.io). Because those scripts need to be trusted - they implement "sandboxing" translated content. One repo for all languages is fine for me.

So, after being processed by Jekyll, the file de-DE/example_from-jekyll.html is not yet ready to go live. E.g. see the problem of translating fragments.

Can those post-processing changes be applied back to de-DE/example_ready-for-jekyll.md? If that's only about various links, it should be easy (maybe even easier to parse and process markdown than html?)

tokideveloper commented 6 years ago

I also think that one translation repo per canonical repo is inconvenient in a similar way. But, to be honest, I don't know the reason for having the already existing canonical sub-repos.

The idea is to cleanly separate "website stuff" from "documentation stuff", mostly for #1019.

Thank you.

I think that one repo for both all languages and all files will fit our needs. Note that all content produced during translation is not trusted - at least not yet - so this characteristic "keeps the translates files together". But maybe there are good reasons concerning the use of Git or the separation of concerns.

This is why I propose keeping scripts in separate repo (not necessary new one - that's why I propose qubesos.github.io). Because those scripts need to be trusted - they implement "sandboxing" translated content.

I see that these scripts (and also new ones) need to be trusted and creating a new repo might be too much, so, placing new ones into qubesos.github.io is fine for me. But I'm not sure what the existing scripts do and what you mean with 'they implement "sandboxing" translated content'.

So, after being processed by Jekyll, the file de-DE/example_from-jekyll.html is not yet ready to go live. E.g. see the problem of translating fragments.

Can those post-processing changes be applied back to de-DE/example_ready-for-jekyll.md? If that's only about various links, it should be easy (maybe even easier to parse and process markdown than html?)

Applying back sounds like closing a loop, resulting in infinite processing. Not so good. Even if we could get it work this way: It would be a hack rather than a clean solution IMHO since the mentioned problem with fragments cannot be solved on an abstract "MD level" but on the "HTML level". Think of this: Before assigning an ID for an HTML element, it should be checked if that ID is already in use - which cannot be done before generating the HTML file by an Markdown processor.

Another reason against: How can I (or a machine) decide whether de-DE/example_ready-for-jekyll.md comes from step X or step Y? If an additional flag is needed then we could also use another directory instead.

Last but not least, HTML headings - at least generated by the current version of kramdown - have a pattern that seems to be easily detectable by a regex:

grep -re '^ *<h[0-9]\+ id="[^"]\+">.*</h[0-9]\+>$' _site/

where .* belongs to the heading in the MD file. Okay, it's a little bit more complicated (think of hard-coded HTML headings in MD files) but I've already got an idea to solve even that.

Is it possible to let GitHub run scripts let's say daily? Or do I have to run those scripts daily and manually on my machine?

marmarek commented 6 years ago

I see that these scripts (and also new ones) need to be trusted and creating a new repo might be too much, so, placing new ones into qubesos.github.io is fine for me. But I'm not sure what the existing scripts do and what you mean with 'they implement "sandboxing" translated content'.

What I meant by "sandboxing" is the script (among other things) will ensure that translated pages will live in /lang/ subtree and will not interfere with other languages (especially base English one). Mostly the process you describe in #3547.

Do you plan to keep de-DE/example_ready-for-jekyll.md committed in some repository? IMO it worth keeping something downloaded from transifex to be able to adjust and re-apply postprocessing, without downloading files from transifex again.

tokideveloper commented 6 years ago

What I meant by "sandboxing" is the script (among other things) will ensure that translated pages will live in /lang/ subtree and will not interfere with other languages (especially base English one). Mostly the process you describe in #3547.

Okay, I see. Translated pages have to reside in a separate name space (i.e. URL path space).

Do you plan to keep de-DE/example_ready-for-jekyll.md committed in some repository? IMO it worth keeping something downloaded from transifex to be able to adjust and re-apply postprocessing, without downloading files from transifex again.

Yes. To be more precisely, even though not elaborated fully, I plan to construct a "working pipeline" by creating a directory for each stage described above. The MD file of a new/updated page has to pass all stages. We say that a file is in a certain stage if it's in the corresponding stage directory and if there is no related file in the next stage directory.

There will be a script (plus helper scripts if needed) for each stage transition. The files shall remain in each directory for tracking purposes. In the last stage, there are the final files, ready for going live.

Beside the files to process, there will also be a log for tracking smaller working steps and a separate one for notifications in case of errors or ambiguities.

Aside from the "working pipeline", there shall be a "tracking pipeline" with the equal internal structure. The difference is that the "working pipeline" is for new pages and updated ones which aren't ready for going live while the "tracking pipeline" is for saving the transition history of the current live pages.

tokideveloper commented 6 years ago

I'm not sure if this question has been overseen, so I ask again: Is it possible to let GitHub (or anything else) run the scripts for translation stuff e.g. daily? Or do I have to run these scripts daily on my private machine?

marmarek commented 6 years ago

Github can't do it, but we probably can use travis (there is an option for scheduled runs), or other similar service. We also have a little of own infrastructure, but this is the least preferable option.

Anyway, when we'll have the scripts, we'll find a way to run it periodically.

QubesOS / qubes-issues

Create separate untrusted submodule for translated files #2925