Automated Build System to Check and Integrate Language Translations

GDQuest / learn-gdscript

Learn Godot's GDScript programming language from zero, right in your browser, for free.

https://gdquest.github.io/learn-gdscript/

Other

2.12k stars 162 forks source link

Automated Build System to Check and Integrate Language Translations #813

Closed NathanLovato closed 1 year ago

NathanLovato commented 1 year ago

Issue: We need a build system that can automatically check if translations for a given language are (nearly) complete and integrate those translations into the app. This build script should be able to add or remove languages automatically depending on the completion level of a given language.

Workflow:

Extract translation strings from the source application, which produces POT (PO template) files that contain the strings to translate with no translations. To do so, run i18n/extract.py from the project's root folder.
Move files to the repository where translations are stored: https://github.com/GDQuest/learn-gdscript-translations
In the translations repository, run the Python script sync_translations.py to update all the PO files which contain the actual translations per language.
Commit changes to the repository.
Translators update translations using PO files.
We integrate the relevant translation files into the app.

We need to ensure that lessons are fully translated (maybe over 90% or 95% translated is OK) before integrating translation files into the app. This feature should be included as part of the build script. The script should also warn about languages with outdated or incomplete translations.

To register languages in the app (show in menu), you can:

Copy the folder containing PO files to the i18n/ directory.
Add the language code to autoload/TranslationManager.gd, in the SUPPORTED_LOCALES array.

Unfortunately, tooling for PO files is not great, so I suggest implementing our mini-parser to check the completion percentage. A minimal parser is available on this link: https://github.com/GDQuest/learn-gdscript/blob/b53ba8e1cc5da074520b0e061c50c02d2a5fe3fc/i18n/match_and_merge_po_translations.py

It can be used as a base to read entries and calculate the translation percentage. In PO files, entries that need translation have the fuzzy keyword. I think that's all we need to check but be sure to double-check if other cases exist.

ml4nC3 commented 1 year ago

Hi NathanLovato,

Thanks for this. As I'm not familiar with build systems, also considering this is my first open-source contribution, may I ask further guidance on following points ?

Do you confirm that this script mission ends on step 4 of your workflow ?
Also, I understand there is a step 4.5 which outupts translations completion indicators to... the user that runs the script ? Or should it commit some report for everyone else ?
What should be the path of this script in the project ?
As you're mentionning Build Systems, is the script going to be executed automatically by something like jenkins or Github Actions or will this be run manually by one of the projects owners during build process ?

If the script runs locally :

should it pull ?
If not, I guess some parameter has to be set to target the appropriate folders, at least for the learn-gdscript-translations location
Do you have suggestion on how I should handle github user/password for the commit ? I guess I could simply go with some inputs statements. Or some basic tkinter window ? Which could handle target directories too.
How could I test ? Maybe by forking the project to my user and testing with this copy.

If the script does not runs locally :

I'll have other questions... ;)

Finally (...for now...) Can you assign this to me ?

Thanks in advance for your help ! Nicolas

NathanLovato commented 1 year ago

Thanks for following-up, Nicolas. Here are the answers to your questions.

The bulk of the work involved in this task is for step 6. The 6 steps were presented just to explain the translation workflow from a technical standpoint.

Most of the code for steps one through four is done already so there is little to do there: i18n/extract.py and sync_translations.py.

The primary things we need a Python program for are, before making a release:

Read all the translation files and check their completion percentage.
Warn of and report incomplete translations
Automatically add or remove languages with a high enough completion percentage to the app.

We should be able to run this program locally and manually to report on incomplete translations and get proper warnings.

Then, optionally, the program should be able to incorporate the translation files into the released app as described in the issue description (the second bullet list) through command line options. The command line option should allow us to specify the path to the Godot project's directory, which will be necessary for the script to work on different computers and via GitHub actions.

The github actions part should be a separate task, and probably something we should handle once the python program is done, because it would involve updating our existing release build action, which my teammate is comfortable changing already. Given the Python program it's just a bit of glue bash code.

Now answering more specific questions:

You don't need to deal with any git or github aspects. For this kind of script you can assume that the git repository is up to date.
The script could be stored either in this repository's i18n/ folder or at the root of the learn-gdscript-translations/ directory. Any clear name will do, e.g. check_and_integrate_translations.py sounds OK.

Please let me know if you have any other questions.

ml4nC3 commented 1 year ago

Great, I think I have a much better understanding now, thanks ! To sum up :

The script should run from command line
The script should react to parameters in order to get expected result
The script should take its inputs from arguments
The script main feature is to output the list of languages with a completion percentage

From your description I'm more thinking of two scripts : check_translations.py and integrate_translations.py. The first one would have an option to output a compatible data for the second so that both could be easily chained with a pipe. This way the solution is flexible and could be used either to get a status or automate integration. Let me know if you see a problem with this design.

I'll keep you posted on my progress.

NathanLovato commented 1 year ago

There's no need for flexibility here, so a single script would be more efficient. Importantly, you never want to integrate translations without checking their completion first. There's a hard dependency there. So having two scripts could lead to a user mistake that the program shouldn't allow. Or if you add code to prevent mistakes, then usage + code are a little bit more complicated for a benefit we'll never use.

I'd make it work like this:

If the user doesn't provide an output path, the program prints translation completion info in a nice human-readable form.
If the user provides an output path, the program integrates relevant translations and prints info about that too.

ml4nC3 commented 1 year ago

Ok, thanks for you're feedback. I'll go with this solution. Also, sorry for my newbe's questions but should I create a branch for my development ? Or will you do the commit yourself ?

NathanLovato commented 1 year ago

Thanks.

Please don't worry about asking questions. They're welcome!

If you've already worked with git and you know how to work with branches, this guide will explain how to create a pull request, which is a way to submit your work for review and integration into an open-source project on github:

Creating a pull request: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request

You don't necessarily need to create new branches to contribute to open-source projects. Creating branches is mostly a good practice when a project has lots of activity. What you do need to do, though, is fork the project because you need to work on your own copy of the files and submit them via a pull request.

If you do want to create a branch, you can do so directly on github. Here's a guide for this:

Creating and deleting branches within your repository: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-and-deleting-branches-within-your-repository

I hope this helps

ml4nC3 commented 1 year ago

Hi NathanLovato,

I'm currently digging into your code, more specifically the match_and_merge_po_translations.py, in order to get a precise understanding of what is to do. Could you please check below algorithm description and let me know if I'm mistaking ?

check_and_integrate_translations.py

Integrate script runs extract.py.
Integrate script moves POT files to translation folder.
Integrate script runs sync_translations.py in translation folder.
- From this point each PO files in languages directories are updated by msgmerge command.
- msgmerge command updates the msgid values as per the POT file.
- If the msgid is not present in target PO file then it is added with an empty msgstr.
- msgmerge command detects target msgid that are close to the POT ones, in this case it replaces msgid value by POT one and adds the fuzzy keyword.
Now the Integrate script has to analyze PO files and determine completion indicator for each language. Here is an algorithm proposition :
- Integrate script iterate through languages directories.
- Integrate script parses each PO file of the directory.
- Integrate script iterate through each PO file's entries.
  - If the entry.msgstr is empty, the msgid is considered as not translated, it counts one missing translation more
  - If entry.is_fuzzy is True, the translation should be reworked, it counts... something. Let's say one 'rework' more for now and we'll see at the end how we include this in the indicator computation.
- Once the file is analyzed it stores results in some data structure and continues.
- Once the language is analyzed it computes a global indicator and store this in some data structure.
Once all languages are analyzed, if some threshold has been given in arguments it computes the list of languages that has a higher value.
It integrates the languages from that list back to the GD_Learn project by copying the whole language folder in i18n/ folder.

If I'm correct, the parsing operation done in match_and_merge_po_translations.py with function parse_po_file generates a handy list of PoFile objects that can be used for my parsing and analyzing algorithm. If yes, should I import it from the script match_and_merge_po_translations.py ? Or should I export part of this script to a python module that would be used by both scripts ?

Also I didn't get the meaning of entry.is_python_format property. Should I consider this value in the script ? EDIT : Is it a tag used by the application to determine it has to apply python code formatting ? Hence it should have the same msgstr value in all languages. Should I make sure it has ?

Regards, Nicolas

ml4nC3 commented 1 year ago

Thinking this further, does it really makes sense to integrate a language that has empty msgstr values ?

Shouldn't we exclude a language as soon as it has some untranslated string and compute the completion indicator only on the fuzzy basis (btw I didn't find any other cases) ?

Or should we be able to choose a behavior through command's options ?

ml4nC3 commented 1 year ago

I have applied the algorithm detailed above, and obtained following output (after extract and sync outputs) :

./cs {'total': 1878, 'missings': 762, 'fuzzies': 251}
./de {'total': 1892, 'missings': 824, 'fuzzies': 437}
./es {'total': 1898, 'missings': 762, 'fuzzies': 251}
./fr {'total': 1882, 'missings': 770, 'fuzzies': 288}
./hu {'total': 1835, 'missings': 965, 'fuzzies': 180}
./id {'total': 1742, 'missings': 1528, 'fuzzies': 8}
./images {'total': 0, 'missings': 0, 'fuzzies': 0}
./it {'total': 1874, 'missings': 761, 'fuzzies': 253}
./ja {'total': 1770, 'missings': 1402, 'fuzzies': 38}
./nl {'total': 1744, 'missings': 1566, 'fuzzies': 7}
./pl {'total': 1835, 'missings': 1121, 'fuzzies': 160}
./pt_BR {'total': 1890, 'missings': 762, 'fuzzies': 251}
./ru {'total': 1883, 'missings': 823, 'fuzzies': 268}
./sv {'total': 1795, 'missings': 1233, 'fuzzies': 100}
./th {'total': 1750, 'missings': 1478, 'fuzzies': 22}
./tr {'total': 1744, 'missings': 1548, 'fuzzies': 29}
./zh_Hans {'total': 1881, 'missings': 762, 'fuzzies': 251}
./zh_Hant {'total': 1758, 'missings': 1580, 'fuzzies': 19}
./ko {'total': 1756, 'missings': 1463, 'fuzzies': 24}

Note that this is a raw print of my result dict for now. I will improve the readability. Should I output a result per PO file ?

For the moment I simply imported the parse function with from match_and_merge_po_translations import parse_po_file. It is working fine.

ml4nC3 commented 1 year ago

How come that the number of total strings is not constant among the different languages ?

ml4nC3 commented 1 year ago

Hi have done further progress. Here is an example output :

(venv) [nicolas@FixNux learn-gdscript]$ python i18n/check_and_integrate_translations.py -t ../learn-gdscript-translations/ -ES -c 60
WARN: Skipping strings extraction and POT files generation...
WARN: Skipping PO files extraction...
INFO: Parsing PO files in 
INFO: Computing completion indicator for each language.
Language : CS - 60% including 13% fuzzy
Language : ES - 60% including 13% fuzzy
Language : IT - 60% including 13% fuzzy
Language : PT_BR - 60% including 13% fuzzy
Language : ZH_HANS - 60% including 13% fuzzy
Language : FR - 60% including 15% fuzzy
Language : RU - 57% including 14% fuzzy
Language : DE - 57% including 23% fuzzy
Language : HU - 48% including 9% fuzzy
Language : PL - 39% including 8% fuzzy
Language : SV - 32% including 5% fuzzy
Language : JA - 21% including 2% fuzzy
Language : KO - 17% including 1% fuzzy
Language : TH - 16% including 1% fuzzy
Language : ID - 13% including 0% fuzzy
Language : TR - 12% including 1% fuzzy
Language : ZH_HANT - 11% including 1% fuzzy
Language : NL - 11% including 0% fuzzy
INFO: Integrating languages above 60%.
Copying  ./cs
Copying  ./es
Copying  ./it
Copying  ./pt_BR
Copying  ./zh_Hans
Copying  ./fr

For now I opted for a completion indicator only on the missing strings basis, with this precision on fuzzies. But of course it is easy to change and adapt for any formula that would suits the team.

The copy is not effective for now, it is solely displaying folders that would be integrated without actually doing it.

If you want to check the code, it is present in my forked repo for now : https://github.com/ml4nC3/learn-gdscript/blob/main/i18n/check_and_integrate_translations.py

Let me know if this fits what you expected or not.

EDIT : Spanish at 60% feels pretty low for a language that is currently considered as complete... Is there something to investigate here ?

NathanLovato commented 1 year ago

Hi NathanLovato, Could you please check below algorithm description and let me know if I'm mistaking ?

Sounds great, and your task is to code points 4 to 6, which is what you've been working on so far it seems. So perfect.

Also I didn't get the meaning of entry.is_python_format property. Should I consider this value in the script ? EDIT : Is it a tag used by the application to determine it has to apply python code formatting ?

If I remember correctly it is a tag added by the Python library that extracts text strings from GDScript and scene files in Godot. This library is used in extract.py. I just added a line in the parser to preserve this tag because I wasn't sure if the library used them to update entries in POT files.

EDIT : Spanish at 60% feels pretty low for a language that is currently considered as complete... Is there something to investigate here ?

There's something to investigate, yes. Several languages which supposedly had all entries translated are showing 60% completion. FR, ES, IT, etc. had complete translations. The fuzzy count seems about right as we tweaked many sentences in the source app that now need review. There are certainly a couple of new entries since these translations were done, but not 40%.

Thinking this further, does it really makes sense to integrate a language that has empty msgstr values ?

It can be, yes, but it's something to consider. For one, if there are just a handful of translations missing, and a language doesn´t have translators, users could still benefit from a 90%+ complete translation.

Then, translators/the community generally works linearly through lessons. So they might have lessons 1 through 12 translated, and 13 and up unfinished. In this case too for users who don´t speak english, 12 lessons is better than 0 I think.

Another possible task after closing this issue would be outputting metadata that the app could use to tell users if a translation is complete or not at runtime and how to contribute translations. But it´s perhaps a stretch considering that we'll be working on the Godot 4 remake of this app.

NathanLovato commented 1 year ago

How come that the number of total strings is not constant among the different languages ?

This can be due to POT and PO files not having been synced since a good while. Especially as some languages get added via a web platform, weblate.

Unfortunately, we constantly have more on our plates than we can handle, so in the absence of fully automated systems or contributors giving us a hand, we can't keep up with some manual things we have to do (we have... 70 open-source repositories I think these days? Can't keep up with that with a team of only 4 full-time)

ml4nC3 commented 1 year ago

Thanks for your feedback !

There's something to investigate, yes. Several languages which supposedly had all entries translated are showing 60% completion. FR, ES, IT, etc. had complete translations.

I've spent a bit of time to understand what's going on. From what I've seen there is mismatches in msgid values because some have been splitted. A few sentences exists in the PO file both as a whole paragraph and splitted in several entries. Translations generally exists only for the whole paragraph. I've done screenshots I'll share with you later.

I also have noticed a strange behavior on language loading in the app. Switching to Spanish gave me first a complete translations, When I switched to French lot of sentences where not found and displayed although I was able to find them in the PO file. Srtange thing, when I switched back to Spanish the sentences untranslated where the same than French... I'll play further with this scenario to try to understand what's going on.

ml4nC3 commented 1 year ago

So my guess is that the issue is mainly coming from extract.py, or any library it is using. In short the POT file extracted does not contains the msgid values the application is actually expecting.

Then I suppose the merge step is trying to do its best but end up in messing around with inconsistent values. It would probably work better with the actual POT ids the application will search.

On the example I'm taking, right from lesson1, first few sentences, here is what working PO files looks like : godot_trans_fr_es

:+1: When I load those PO files and launch the app I'm getting a correct translations for both Spanish and French.

You can see that the first paragraph is here showing as a whole msgid.

If I run extract.py, I'm getting following POT file : godot_trans_diff_POT_es

:-1: From this point already you can see the msgid has been splitted in 2 entries.

If I run the whole process, here is what I get in the new PO files : godot_trans_fr_es_KO

Observations :

The first sentence is showing twice in translated version : as msgstr associated with the first msgid, AND in second msgstr.
The translation of the first sentence is shown with a slight variation on vocabulary. However the correct one is the second, which correspond to the whole value from the original PO file. The first one does not have exactly the same meaning.
- and strange fact : the french and spanish versions have the same meaning. As if it is taking an old value of a slightly different msgid value...

I'll try to dive in extract.py to see if I can spot something, but if you have a few insight to share that would be helpful !

ml4nC3 commented 1 year ago

Well it was not that hard to find... It seems this behavior is intentional as explained in the comment of your commit from last september : godot_trans_lovato_commit

Have you integrated a PO file in the application since you implemented this split ? If yes how do you merge back the blocks so the application is finding its expected strings ?

ml4nC3 commented 1 year ago

Okay so I've revert your commit on my forked repo, and indeed the values are no longer splitted. Immediate effect on the check_and_integrate_translations.py script is that we jump from 60% completion to 97% :

python i18n/check_and_integrate_translations.py -t ../learn-gdscript-translations/ -ES -c 96
WARN: Skipping strings extraction and POT files generation...
WARN: Skipping PO files merging with POT...
INFO: Parsing PO files and counting missing translations
INFO: Computing completion indicator for each language.
Language : CS - 97% including 21% fuzzy
Language : ES - 97% including 21% fuzzy
Language : IT - 97% including 22% fuzzy
Language : PT_BR - 97% including 21% fuzzy
Language : ZH_HANS - 97% including 21% fuzzy
Language : FR - 96% including 25% fuzzy
Language : RU - 92% including 24% fuzzy
Language : DE - 92% including 41% fuzzy
Language : HU - 77% including 20% fuzzy
Language : PL - 63% including 23% fuzzy
Language : SV - 52% including 17% fuzzy
Language : JA - 35% including 10% fuzzy
Language : KO - 28% including 7% fuzzy
Language : TH - 26% including 7% fuzzy
Language : ID - 21% including 3% fuzzy
Language : TR - 19% including 14% fuzzy
Language : ZH_HANT - 17% including 10% fuzzy
Language : NL - 17% including 3% fuzzy
INFO: Integrating languages above 96%.
Copying  ./cs
Copying  ./es
Copying  ./it
Copying  ./pt_BR
Copying  ./zh_Hans
Copying  ./fr

I guess we cannot just simply revert your commit, you probably did it for a good reason. But it seems it is the main cause of ids discrepancies.

However another issue is now showing : line breaks are not a the same position between POT and PO files, as we can see here on the first paragraph : godot_trans_diff_POT_fr_linebreak

If I correct manually the value then the translated value is correctly showing in the application.

ml4nC3 commented 1 year ago

On this string break issue, it seems it is linked to a default behavior from Babel. The pofile.write_po() method has a width parameter that is set by default to 76.

The documentation says :

:param width: the maximum line width for the generated output; use None, 0, or a negative number to completely disable line wrapping

In extract.py we can disable line wrapping :

with open(output_file, "wb") as file:
    pofile.write_po(
        fileobj=file,
        catalog=cat,
        width=-1
    )

And it is indeed stopping the POT file from having lines wrapped.

However it is only half way to the solution because current PO files are already created with lines wrapped, so the sync_translations.py is actually keeping the wrapped versions of strings id.

Hence I'm seeing two approaches for a full resolution :

We keep this behavior as it is now, but tweek the way the strings are read by the application. I'm guessing it might not concatenate strings before comparing ids ?
I write a script to re-generate current PO files and removing line wrapping by concatenating all string lines that does not end with "\n". This way the msgid should match the POT files generated with a disabled line wrapping on babel.write_po() call.

Let me know how you feel about this. Am I getting too far ? From this point I'll wait for your feedback before progressing further.

NathanLovato commented 1 year ago

Well it was not that hard to find... It seems this behavior is intentional as explained in the comment of your commit from last september :

I committed that? I remember working on this, the plan was to split translation entries to have one line per entry to make updating translations much more efficient. But we ended up canceling the plan as we were planning to shift focus to Godot 4, as then translation strings wouldn´t need to change much anymore (as we wouldn't make changes to the content as often as last year).

It should probably be reverted, though I need to double-check on my end to ensure I didn´t misunderstand anything.

Hence I'm seeing two approaches for a full resolution :

Which approach would you like to go with? Based on your tests, did one approach seem more reliable than the other? Feel free to go for this.

And please let me know if there's any way that I can help. Feel free to open PRs anytime for me to test or provide a code review.

A quick note for when you open a PR: For the sake of efficiency in the team we directly add comments to teammates pull requests. Would you be fine if we did that with yo as well or would you prefer us to leave comments and let you make every code change yourself? We're primarily talking bug fixes if any, and possibly tweaking variable names or something, moving comments to a docstring... quick refactoring work primarily to be sure we can reread the code easily months down the line.

ml4nC3 commented 1 year ago

It should probably be reverted, though I need to double-check on my end to ensure I didn´t misunderstand anything.

Okay, let me know !

Which approach would you like to go with? Based on your tests, did one approach seem more reliable than the other?

My gut feeling is that the application should not be sensible to such a difference. Furthermore it is a default behavior in Babel so I guess we should expect the application to adapt. Although I find a bit strange to wrap text in a technical storage file... I guess PO files are more or less designed to be read by humans, so let's keep it that way.

I'll go first confirming my hypothesis on how the application reads the strings and see if there is a simple way of tweeking it. This is also an occasion for me to dive in the UI usage of Godot which is my short term goal right now !

Would you be fine if we did that with yo as well or would you prefer us to leave comments and let you make every code change yourself?

I have absolutely no issue letting someone changing my code. Open source is meant to be collaborative, and it is crucial that the core team ensure it's appropriation of anything part of their code base. Just let me know what have been done and why, I'm here to learn !

And please let me know if there's any way that I can help. Feel free to open PRs anytime for me to test or provide a code review.

Okay, I'll check that first. I guess it would be for the better that you review what I did up to now, but I'm not yet comfortable with the process : is it fine if I send you some code although the solution is not yet fully functional ? Next step for me is to read that Pull request guide you shared earlier and I'll get back to you on that.

NathanLovato commented 1 year ago

I'll go first confirming my hypothesis on how the application reads the strings and see if there is a simple way of tweeking it. This is also an occasion for me to dive in the UI usage of Godot which is my short term goal right now !

Sounds great, go for it then!

I guess PO files are more or less designed to be read by humans, so let's keep it that way.

I would think this default to be more of an artifact of old guidelines, with code editors not supporting long lines and line wrapping too well in the past. Even not so far back, I've been using Emacs for some years and long line support and solid line wrapping wasn't quite there yet when I started using it... around 2016 or so.

We would wrap lines like this on the Godot documentation, and there isn't really a need for it anymore other than staying consistent with documents from 9 years ago.

I guess it would be for the better that you review what I did up to now, but I'm not yet comfortable with the process

Actually, it's completely fine if you prefer to wait to be further along before creating a pull request. Whatever works.

I have absolutely no issue letting someone changing my code. Open source is meant to be collaborative

I truly appreciate not only the time and efforts you're putting into this, but also the great mindset and communication. Thank you very much for that!

ml4nC3 commented 1 year ago

I've sent a Pull Request more as a test to see how it is going than an actual request.

I've seen there is a requirement on commit messages that I might not have respected. I will probably have to correct that at least, but do let me know if there is something else.

NathanLovato commented 1 year ago

Thank you very much! I will review and test your code as soon as possible.

I've seen there is a requirement on commit messages that I might not have respected.

Please don't worry about that. I can change the message upon merging the pull request.