Website localization: Minimize/trim strings to translate

tokideveloper commented 5 years ago

The purpose to do so is to minimize the lengths of strings to translate. The less special (i.e. confusing) symbols there are the less error-prone translating is for non-technical translators.

Motivation: The best file format for us to use when uploading to Transifex is *.txt since it provides us the most flexible opportunities for all file formats we have to deal with (*.md, *.html, *.yml etc.) as researched in #3548. However, all special symbols will be shown to the translators.

tokideveloper commented 5 years ago

My idea here is to split each Markdown/YAML/etc. file into a skeleton file to keep in the repo and a *.txt file to translate. The skeleton will contain both the YAML front matter and the overall structure of the file while the file to translate will only comprise the relevant strings.

Example:

Given a file test.md:

---
layout: doc
title: Hello World Example
permalink: /doc/example
redirect_from:
- /xample/
- /en/xample/
---

Hello World
===========

This is a test for splitting a Markdown file before uploading to Transifex.
Inline formatting like **this** is kept for translation.
Only the surrounding structures remain in the skeleton.

A shiny header
--------------

Quoting something:

> This is a test
> for a quote.

Enumerate something:

1. First item
   * mentioning something important
   * adding a conclusion
2. Second item
   1. first subitem
   2. second subitem

### H3 header ###

```sh
#!/bin/sh
# Comments in a
# shell script file.

echo 'Hello World!'
```

Thanks for reading!

It will be split into the files test.md.skel and test.md.txt.

File test.md.skel:

---
layout: doc
title: LINE
permalink: /LANGCODE/doc/example
redirect_from:
- 
- /LANGCODE/xample/
---

LINE
===========

LINE
LINE
LINE

LINE
--------------

LINE

> LINE
> LINE

LINE

1. LINE
   * LINE
   * LINE
2. LINE
   1. LINE
   2. LINE

### LINE ###

```sh
LINE
LINE
LINE

LINE
```

LINE

File test.md.txt:


Hello World Example

Hello World

This is a test for splitting a Markdown file before uploading to Transifex.
Inline formatting like **this** is kept for translation.
Only the surrounding structures remain in the skeleton.

A shiny header

Quoting something:

This is a test
for a quote.

Enumerate something:

First item
mentioning something important
adding a conclusion
Second item
first subitem
second subitem

H3 header

#!/bin/sh
# Comments in a
# shell script file.

echo 'Hello World!'

Thanks for reading!

The file test.md.txt, containing the same number of lines as in test.md and test.md.skel, will then be uploaded to Transifex. Note that only the non-empty lines will be offered for translation at Transifex. After translation for a specific language lang, we will download a translated file say test.md.lang with, again, the same number of lines as in the original file and the skeleton.

The two final steps are easy: (1) Read line by line from both test.md.skel and test.md.lang and replace the LINE placeholders from the skeleton with the matching line from the translated file and write the result to an appropriate file. (2) Also, replace the LANGCODE placeholders with the appropriate language code, e.g. de-DE for German.

Any comments?

EDIT: The step of splitting the original file can be automated using an appropriate parser. EDIT 2: Added the LINE for the title in the YAML front matter.

marmarek commented 5 years ago

Wouldn't this make translation harder? Specifically, it will take out some of the context, like whether something is a title, item on a list, some command to execute or part of longer text.

tokideveloper commented 5 years ago

@marmarek commented on 26. Jan. 2019, 17:40 MEZ:

Wouldn't this make translation harder? Specifically, it will take out some of the context, like whether something is a title, item on a list, some command to execute or part of longer text.

IMHO, the translator should have a look at the related web page anyway. For example, if a translator doesn't know that ~~~ marks the beginning or ending of a code block then showing them in the file to translate won't help either. And I'm optimistic that a look at the page will help understanding the context better since the text goes from top to bottom similar as the lines in Transifex, so, getting the matching lines on the page shouldn't be a pain. In addition, note that images on the web page aren't shown at Transifex (except for the URL to the image). So, the translator should always keep one eye on the related page.

Note that we've already discussed how to show the path of the URL of a resource file here.

If there is really a big lack of the context of a specific line then we can provide it using the "Edit context" button at that line at Transifex.

EDIT: IIRC, Transifex removes leading (and trailing?) whitespaces from the lines to translate and reinserts them when downloading the translated files. Therefore, e.g. list sub-items and code blocks by indentations will lose their contexts anyway when appearing at the Transifex interface.

Last but not least, for non-technical translators, all the special characters could look terrible at the Transifex interface, IMHO. (No, it's not because of Transifex, it's just because of the nature of Markdown files when not displayed as a usual text file with a monospaced font.) Also, paying attention to them and keep them cleanly copied is somewhat hard to do, error-prone, costs time and distracts from the main job of translating, IMHO.

tokideveloper commented 5 years ago

@marmarek: May we use external Markdown/HMTL/Liquid/YAML parsers or do we have to implement them on our own?

marmarek commented 5 years ago

It's ok to use existing tools, but those should be reasonably integrity protected (prevent targeted attacks). Ideally installed using normal package manager. If not an option, then at least verify hash of the tool/archive and verify it before use.

QubesOS / qubes-issues

Website localization: Minimize/trim strings to translate #4768