Closed NathanLovato closed 1 year ago
Hi NathanLovato,
Thanks for this. As I'm not familiar with build systems, also considering this is my first open-source contribution, may I ask further guidance on following points ?
If the script runs locally :
If the script does not runs locally :
Finally (...for now...) Can you assign this to me ?
Thanks in advance for your help ! Nicolas
Thanks for following-up, Nicolas. Here are the answers to your questions.
The bulk of the work involved in this task is for step 6. The 6 steps were presented just to explain the translation workflow from a technical standpoint.
Most of the code for steps one through four is done already so there is little to do there: i18n/extract.py
and sync_translations.py
.
The primary things we need a Python program for are, before making a release:
We should be able to run this program locally and manually to report on incomplete translations and get proper warnings.
Then, optionally, the program should be able to incorporate the translation files into the released app as described in the issue description (the second bullet list) through command line options. The command line option should allow us to specify the path to the Godot project's directory, which will be necessary for the script to work on different computers and via GitHub actions.
The github actions part should be a separate task, and probably something we should handle once the python program is done, because it would involve updating our existing release build action, which my teammate is comfortable changing already. Given the Python program it's just a bit of glue bash code.
Now answering more specific questions:
i18n/
folder or at the root of the learn-gdscript-translations/
directory. Any clear name will do, e.g. check_and_integrate_translations.py
sounds OK.Please let me know if you have any other questions.
Great, I think I have a much better understanding now, thanks ! To sum up :
From your description I'm more thinking of two scripts : check_translations.py
and integrate_translations.py
. The first one would have an option to output a compatible data for the second so that both could be easily chained with a pipe. This way the solution is flexible and could be used either to get a status or automate integration. Let me know if you see a problem with this design.
I'll keep you posted on my progress.
There's no need for flexibility here, so a single script would be more efficient. Importantly, you never want to integrate translations without checking their completion first. There's a hard dependency there. So having two scripts could lead to a user mistake that the program shouldn't allow. Or if you add code to prevent mistakes, then usage + code are a little bit more complicated for a benefit we'll never use.
I'd make it work like this:
Ok, thanks for you're feedback. I'll go with this solution. Also, sorry for my newbe's questions but should I create a branch for my development ? Or will you do the commit yourself ?
Thanks.
Please don't worry about asking questions. They're welcome!
If you've already worked with git and you know how to work with branches, this guide will explain how to create a pull request, which is a way to submit your work for review and integration into an open-source project on github:
You don't necessarily need to create new branches to contribute to open-source projects. Creating branches is mostly a good practice when a project has lots of activity. What you do need to do, though, is fork the project because you need to work on your own copy of the files and submit them via a pull request.
If you do want to create a branch, you can do so directly on github. Here's a guide for this:
I hope this helps
Hi NathanLovato,
I'm currently digging into your code, more specifically the match_and_merge_po_translations.py
, in order to get a precise understanding of what is to do. Could you please check below algorithm description and let me know if I'm mistaking ?
check_and_integrate_translations.py
extract.py
.sync_translations.py
in translation folder.
msgmerge
command.msgmerge
command updates the msgid
values as per the POT file.msgid
is not present in target PO file then it is added with an empty msgstr
.msgmerge
command detects target msgid
that are close to the POT ones, in this case it replaces msgid
value by POT one and adds the fuzzy
keyword.entry.msgstr
is empty, the msgid
is considered as not translated, it counts one missing translation moreentry.is_fuzzy
is True, the translation should be reworked, it counts... something. Let's say one 'rework' more for now and we'll see at the end how we include this in the indicator computation.threshold
has been given in arguments it computes the list of languages that has a higher value.GD_Learn
project by copying the whole language folder in i18n/
folder.If I'm correct, the parsing operation done in match_and_merge_po_translations.py
with function parse_po_file
generates a handy list of PoFile
objects that can be used for my parsing and analyzing algorithm. If yes, should I import it from the script match_and_merge_po_translations.py
? Or should I export part of this script to a python module that would be used by both scripts ?
Also I didn't get the meaning of entry.is_python_format
property. Should I consider this value in the script ?
EDIT : Is it a tag used by the application to determine it has to apply python code formatting ? Hence it should have the same msgstr value in all languages. Should I make sure it has ?
Regards, Nicolas
Thinking this further, does it really makes sense to integrate a language that has empty msgstr values ?
Shouldn't we exclude a language as soon as it has some untranslated string and compute the completion indicator only on the fuzzy
basis (btw I didn't find any other cases) ?
Or should we be able to choose a behavior through command's options ?
I have applied the algorithm detailed above, and obtained following output (after extract and sync outputs) :
./cs {'total': 1878, 'missings': 762, 'fuzzies': 251}
./de {'total': 1892, 'missings': 824, 'fuzzies': 437}
./es {'total': 1898, 'missings': 762, 'fuzzies': 251}
./fr {'total': 1882, 'missings': 770, 'fuzzies': 288}
./hu {'total': 1835, 'missings': 965, 'fuzzies': 180}
./id {'total': 1742, 'missings': 1528, 'fuzzies': 8}
./images {'total': 0, 'missings': 0, 'fuzzies': 0}
./it {'total': 1874, 'missings': 761, 'fuzzies': 253}
./ja {'total': 1770, 'missings': 1402, 'fuzzies': 38}
./nl {'total': 1744, 'missings': 1566, 'fuzzies': 7}
./pl {'total': 1835, 'missings': 1121, 'fuzzies': 160}
./pt_BR {'total': 1890, 'missings': 762, 'fuzzies': 251}
./ru {'total': 1883, 'missings': 823, 'fuzzies': 268}
./sv {'total': 1795, 'missings': 1233, 'fuzzies': 100}
./th {'total': 1750, 'missings': 1478, 'fuzzies': 22}
./tr {'total': 1744, 'missings': 1548, 'fuzzies': 29}
./zh_Hans {'total': 1881, 'missings': 762, 'fuzzies': 251}
./zh_Hant {'total': 1758, 'missings': 1580, 'fuzzies': 19}
./ko {'total': 1756, 'missings': 1463, 'fuzzies': 24}
Note that this is a raw print of my result dict for now. I will improve the readability. Should I output a result per PO file ?
For the moment I simply imported the parse function with from match_and_merge_po_translations import parse_po_file
. It is working fine.
How come that the number of total strings is not constant among the different languages ?
Hi have done further progress. Here is an example output :
(venv) [nicolas@FixNux learn-gdscript]$ python i18n/check_and_integrate_translations.py -t ../learn-gdscript-translations/ -ES -c 60
WARN: Skipping strings extraction and POT files generation...
WARN: Skipping PO files extraction...
INFO: Parsing PO files in
INFO: Computing completion indicator for each language.
Language : CS - 60% including 13% fuzzy
Language : ES - 60% including 13% fuzzy
Language : IT - 60% including 13% fuzzy
Language : PT_BR - 60% including 13% fuzzy
Language : ZH_HANS - 60% including 13% fuzzy
Language : FR - 60% including 15% fuzzy
Language : RU - 57% including 14% fuzzy
Language : DE - 57% including 23% fuzzy
Language : HU - 48% including 9% fuzzy
Language : PL - 39% including 8% fuzzy
Language : SV - 32% including 5% fuzzy
Language : JA - 21% including 2% fuzzy
Language : KO - 17% including 1% fuzzy
Language : TH - 16% including 1% fuzzy
Language : ID - 13% including 0% fuzzy
Language : TR - 12% including 1% fuzzy
Language : ZH_HANT - 11% including 1% fuzzy
Language : NL - 11% including 0% fuzzy
INFO: Integrating languages above 60%.
Copying ./cs
Copying ./es
Copying ./it
Copying ./pt_BR
Copying ./zh_Hans
Copying ./fr
For now I opted for a completion indicator only on the missing strings basis, with this precision on fuzzies. But of course it is easy to change and adapt for any formula that would suits the team.
The copy is not effective for now, it is solely displaying folders that would be integrated without actually doing it.
If you want to check the code, it is present in my forked repo for now : https://github.com/ml4nC3/learn-gdscript/blob/main/i18n/check_and_integrate_translations.py
Let me know if this fits what you expected or not.
EDIT : Spanish at 60% feels pretty low for a language that is currently considered as complete... Is there something to investigate here ?
Hi NathanLovato, Could you please check below algorithm description and let me know if I'm mistaking ?
Sounds great, and your task is to code points 4 to 6, which is what you've been working on so far it seems. So perfect.
Also I didn't get the meaning of
entry.is_python_format
property. Should I consider this value in the script ? EDIT : Is it a tag used by the application to determine it has to apply python code formatting ?
If I remember correctly it is a tag added by the Python library that extracts text strings from GDScript and scene files in Godot. This library is used in extract.py
. I just added a line in the parser to preserve this tag because I wasn't sure if the library used them to update entries in POT files.
EDIT : Spanish at 60% feels pretty low for a language that is currently considered as complete... Is there something to investigate here ?
There's something to investigate, yes. Several languages which supposedly had all entries translated are showing 60% completion. FR, ES, IT, etc. had complete translations. The fuzzy count seems about right as we tweaked many sentences in the source app that now need review. There are certainly a couple of new entries since these translations were done, but not 40%.
Thinking this further, does it really makes sense to integrate a language that has empty msgstr values ?
It can be, yes, but it's something to consider. For one, if there are just a handful of translations missing, and a language doesn´t have translators, users could still benefit from a 90%+ complete translation.
Then, translators/the community generally works linearly through lessons. So they might have lessons 1 through 12 translated, and 13 and up unfinished. In this case too for users who don´t speak english, 12 lessons is better than 0 I think.
Another possible task after closing this issue would be outputting metadata that the app could use to tell users if a translation is complete or not at runtime and how to contribute translations. But it´s perhaps a stretch considering that we'll be working on the Godot 4 remake of this app.
How come that the number of total strings is not constant among the different languages ?
This can be due to POT and PO files not having been synced since a good while. Especially as some languages get added via a web platform, weblate.
Unfortunately, we constantly have more on our plates than we can handle, so in the absence of fully automated systems or contributors giving us a hand, we can't keep up with some manual things we have to do (we have... 70 open-source repositories I think these days? Can't keep up with that with a team of only 4 full-time)
Thanks for your feedback !
There's something to investigate, yes. Several languages which supposedly had all entries translated are showing 60% completion. FR, ES, IT, etc. had complete translations.
I've spent a bit of time to understand what's going on. From what I've seen there is mismatches in msgid
values because some have been splitted. A few sentences exists in the PO file both as a whole paragraph and splitted in several entries. Translations generally exists only for the whole paragraph. I've done screenshots I'll share with you later.
I also have noticed a strange behavior on language loading in the app. Switching to Spanish gave me first a complete translations, When I switched to French lot of sentences where not found and displayed although I was able to find them in the PO file. Srtange thing, when I switched back to Spanish the sentences untranslated where the same than French... I'll play further with this scenario to try to understand what's going on.
So my guess is that the issue is mainly coming from extract.py
, or any library it is using. In short the POT file extracted does not contains the msgid
values the application is actually expecting.
Then I suppose the merge step is trying to do its best but end up in messing around with inconsistent values. It would probably work better with the actual POT ids the application will search.
On the example I'm taking, right from lesson1
, first few sentences, here is what working PO files looks like :
:+1: When I load those PO files and launch the app I'm getting a correct translations for both Spanish and French.
You can see that the first paragraph is here showing as a whole msgid
.
If I run extract.py
, I'm getting following POT file :
:-1: From this point already you can see the msgid
has been splitted in 2 entries.
If I run the whole process, here is what I get in the new PO files :
Observations :
msgstr
associated with the first msgid
, AND in second msgstr
.msgid
value...I'll try to dive in extract.py
to see if I can spot something, but if you have a few insight to share that would be helpful !
Well it was not that hard to find... It seems this behavior is intentional as explained in the comment of your commit from last september :
Have you integrated a PO file in the application since you implemented this split ? If yes how do you merge back the blocks so the application is finding its expected strings ?
Okay so I've revert your commit on my forked repo, and indeed the values are no longer splitted. Immediate effect on the check_and_integrate_translations.py
script is that we jump from 60% completion to 97% :
python i18n/check_and_integrate_translations.py -t ../learn-gdscript-translations/ -ES -c 96
WARN: Skipping strings extraction and POT files generation...
WARN: Skipping PO files merging with POT...
INFO: Parsing PO files and counting missing translations
INFO: Computing completion indicator for each language.
Language : CS - 97% including 21% fuzzy
Language : ES - 97% including 21% fuzzy
Language : IT - 97% including 22% fuzzy
Language : PT_BR - 97% including 21% fuzzy
Language : ZH_HANS - 97% including 21% fuzzy
Language : FR - 96% including 25% fuzzy
Language : RU - 92% including 24% fuzzy
Language : DE - 92% including 41% fuzzy
Language : HU - 77% including 20% fuzzy
Language : PL - 63% including 23% fuzzy
Language : SV - 52% including 17% fuzzy
Language : JA - 35% including 10% fuzzy
Language : KO - 28% including 7% fuzzy
Language : TH - 26% including 7% fuzzy
Language : ID - 21% including 3% fuzzy
Language : TR - 19% including 14% fuzzy
Language : ZH_HANT - 17% including 10% fuzzy
Language : NL - 17% including 3% fuzzy
INFO: Integrating languages above 96%.
Copying ./cs
Copying ./es
Copying ./it
Copying ./pt_BR
Copying ./zh_Hans
Copying ./fr
I guess we cannot just simply revert your commit, you probably did it for a good reason. But it seems it is the main cause of ids discrepancies.
However another issue is now showing : line breaks are not a the same position between POT and PO files, as we can see here on the first paragraph :
If I correct manually the value then the translated value is correctly showing in the application.
On this string break issue, it seems it is linked to a default behavior from Babel
. The pofile.write_po()
method has a width
parameter that is set by default to 76
.
The documentation says :
:param width: the maximum line width for the generated output; use
None
, 0, or a negative number to completely disable line wrapping
In extract.py
we can disable line wrapping :
with open(output_file, "wb") as file:
pofile.write_po(
fileobj=file,
catalog=cat,
width=-1
)
And it is indeed stopping the POT file from having lines wrapped.
However it is only half way to the solution because current PO files are already created with lines wrapped, so the sync_translations.py
is actually keeping the wrapped versions of strings id.
Hence I'm seeing two approaches for a full resolution :
msgid
should match the POT files generated with a disabled line wrapping on babel.write_po()
call.Let me know how you feel about this. Am I getting too far ? From this point I'll wait for your feedback before progressing further.
Well it was not that hard to find... It seems this behavior is intentional as explained in the comment of your commit from last september :
I committed that? I remember working on this, the plan was to split translation entries to have one line per entry to make updating translations much more efficient. But we ended up canceling the plan as we were planning to shift focus to Godot 4, as then translation strings wouldn´t need to change much anymore (as we wouldn't make changes to the content as often as last year).
It should probably be reverted, though I need to double-check on my end to ensure I didn´t misunderstand anything.
Hence I'm seeing two approaches for a full resolution :
Which approach would you like to go with? Based on your tests, did one approach seem more reliable than the other? Feel free to go for this.
And please let me know if there's any way that I can help. Feel free to open PRs anytime for me to test or provide a code review.
A quick note for when you open a PR: For the sake of efficiency in the team we directly add comments to teammates pull requests. Would you be fine if we did that with yo as well or would you prefer us to leave comments and let you make every code change yourself? We're primarily talking bug fixes if any, and possibly tweaking variable names or something, moving comments to a docstring... quick refactoring work primarily to be sure we can reread the code easily months down the line.
It should probably be reverted, though I need to double-check on my end to ensure I didn´t misunderstand anything.
Okay, let me know !
Which approach would you like to go with? Based on your tests, did one approach seem more reliable than the other?
My gut feeling is that the application should not be sensible to such a difference. Furthermore it is a default behavior in Babel
so I guess we should expect the application to adapt. Although I find a bit strange to wrap text in a technical storage file... I guess PO files are more or less designed to be read by humans, so let's keep it that way.
I'll go first confirming my hypothesis on how the application reads the strings and see if there is a simple way of tweeking it. This is also an occasion for me to dive in the UI usage of Godot which is my short term goal right now !
Would you be fine if we did that with yo as well or would you prefer us to leave comments and let you make every code change yourself?
I have absolutely no issue letting someone changing my code. Open source is meant to be collaborative, and it is crucial that the core team ensure it's appropriation of anything part of their code base. Just let me know what have been done and why, I'm here to learn !
And please let me know if there's any way that I can help. Feel free to open PRs anytime for me to test or provide a code review.
Okay, I'll check that first. I guess it would be for the better that you review what I did up to now, but I'm not yet comfortable with the process : is it fine if I send you some code although the solution is not yet fully functional ? Next step for me is to read that Pull request guide you shared earlier and I'll get back to you on that.
I'll go first confirming my hypothesis on how the application reads the strings and see if there is a simple way of tweeking it. This is also an occasion for me to dive in the UI usage of Godot which is my short term goal right now !
Sounds great, go for it then!
I guess PO files are more or less designed to be read by humans, so let's keep it that way.
I would think this default to be more of an artifact of old guidelines, with code editors not supporting long lines and line wrapping too well in the past. Even not so far back, I've been using Emacs for some years and long line support and solid line wrapping wasn't quite there yet when I started using it... around 2016 or so.
We would wrap lines like this on the Godot documentation, and there isn't really a need for it anymore other than staying consistent with documents from 9 years ago.
I guess it would be for the better that you review what I did up to now, but I'm not yet comfortable with the process
Actually, it's completely fine if you prefer to wait to be further along before creating a pull request. Whatever works.
I have absolutely no issue letting someone changing my code. Open source is meant to be collaborative
I truly appreciate not only the time and efforts you're putting into this, but also the great mindset and communication. Thank you very much for that!
I've sent a Pull Request more as a test to see how it is going than an actual request.
I've seen there is a requirement on commit messages that I might not have respected. I will probably have to correct that at least, but do let me know if there is something else.
Thank you very much! I will review and test your code as soon as possible.
I've seen there is a requirement on commit messages that I might not have respected.
Please don't worry about that. I can change the message upon merging the pull request.
Issue: We need a build system that can automatically check if translations for a given language are (nearly) complete and integrate those translations into the app. This build script should be able to add or remove languages automatically depending on the completion level of a given language.
Workflow:
i18n/extract.py
from the project's root folder.sync_translations.py
to update all the PO files which contain the actual translations per language.We need to ensure that lessons are fully translated (maybe over 90% or 95% translated is OK) before integrating translation files into the app. This feature should be included as part of the build script. The script should also warn about languages with outdated or incomplete translations.
To register languages in the app (show in menu), you can:
Unfortunately, tooling for PO files is not great, so I suggest implementing our mini-parser to check the completion percentage. A minimal parser is available on this link: https://github.com/GDQuest/learn-gdscript/blob/b53ba8e1cc5da074520b0e061c50c02d2a5fe3fc/i18n/match_and_merge_po_translations.py
It can be used as a base to read entries and calculate the translation percentage. In PO files, entries that need translation have the
fuzzy
keyword. I think that's all we need to check but be sure to double-check if other cases exist.