Convert all `.tr` methods into `.translate` to support i18n

biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

https://orangedatamining.com

Other

4.88k stars 1.02k forks source link

Convert all `.tr` methods into `.translate` to support i18n #5409

Closed inthedark122 closed 3 years ago

inthedark122 commented 3 years ago

I investigated code base of orange3 and PyQT5 and found that is not possible to make custom localization files We can use pylupdate5 to generate .ts files for making localization files. I made them, but PyQt5 don't support those files with .tr function. This is limitation of the scope of classes. You can read more in the documentation https://www.riverbankcomputing.com/static/Docs/PyQt5/i18n.html

To add custom translation files we can:

generate .ts file by running command: pylupdate5 /path/to//lib/python3.9/site-packages/orangecanvas/application/canvasmain.py -ts orange.<lang>.ts
make translation for orange.<lang>.ts in the GUI application. You can run GUI application with command qt5-tools linguist
after translation we can generate File -> Release and use them as option in the orange-canvas

What's your use case?

[ ] Make translation for other language

What's your proposed solution?

[ ] Convert all .tr methods into .transpate
[ ] Make option in CLI to retrieve path to language file

janezd commented 3 years ago

TL;DR: Truly supporting i18n is much much more than renaming tr to translate. It has a high price tag for developers. What I fear is that somebody would be all hyped-up for translating Orange, we'd spend weeks and weeks preparing the code for it (see below), (s)he would translate some part and then disappear, leaving us with all that rubbish in the code. Translation requires a serious commitment from both sides.

Longer version

Changing tr to translate is the smallest of the problems here.

There are only a few strings that are marked as tr. Calling a different function wouldn't solve anything. (Related to that, @markotoplak said he'd rather have (easily) editable translations, and I, too, have a preference for gettext over Qt's .ts files.) We'd need to go through all the code and mark all strings.
The function for translation was traditionally _ in order not to decrease the code readability (and line lengths). Lately, _ is used for redundant argument or variable, hence we'd need to use something longer. tr is OK-ish, while translate(...) or gettext(...) adds around 10 characters of rubbish in many many places. This is of course the required price to pay for translatable code -- but are we willing to pay it, and keep paying it?
Orange uses f-strings, except in very old code. f-strings are expressions and thus untranslatable. To translate them, we'd need to revert to using format, which adds another burden to code readability. Moreover, f-strings are elegant and clear, format is clumsy and mistake-prone (in comparison). At least for me, abandoning f-strings is almost beyond the red line.
There are hard-coded English-specific strings, like adding 's' for plural. Supporting translation would require those to be rewritten. (This one hurts least.)
This one is the worst: it's not something we implement and we're done, but it requires a continuous commitment -- all new strings have to be properly marked for translation etc. Otherwise translators would invest time in one version and find it impossible to translate the next. Furthermore, we occasionally change a string just for the sake of slightly better wording. If we make Orange translatable, we'd have to stop doing this.
Orange has many add-ons. Will people translate them, too? Or suffer an awful-looking mixture of original and translated text in one widget (because some parts may be inherited).

These are just the points from the top of my head.

I'd estimate points 2-5 would take a few weeks (so, probably: months) of somebody's time. For a single shot, not including maintenance.

I need to mention that we had a translatable version, with translations to Slovenian and Japanese 11 years ago. A single released version. It took me almost two months, and it was abandonded soon after.

Our core team is Slovenian. I guess the only way we can make this work is for us to have a strong interest in having a Slovenian translation and investing our time in making and maintaining it.

janezd commented 3 years ago

Closed due to inactivity.

bigeyex commented 3 years ago

Hi, just came here from Discord to leave a note about my attempt to touch on this topic.

Currently, I'm trying to make a (Chinese) translation on my own branch. I'm planning to merge new changes from upstream at intervals and try if my solution lasts (then we could discuss making changes in the main project or other moves).

If interested, my changes are listed in https://github.com/bigeyex/orange3 https://github.com/bigeyex/orange-canvas-core https://github.com/bigeyex/orange-widget-base

My approach

After some trial and error, I decided to use Python gettext instead of PyQt's translate/tr. Because some steps happen before Qt environment. While changing the language on the fly is not my priority, this saves me a lot of trouble.
I use the global underscore function() and renamed existing ".tr(" to "(". If the underscore is replaced, I reassign it as gettext. This works fine by now (as for the f-strings, "_(f'some string value')" actually works fine.
I translated menu and data Widgets by now. Although translation happens across multiple projects, I place my translation in a single file under (orange3)/locale. This makes translations easier to manage (for example, adding a Slovenian translation may only involve language work on a single .po file).

bigeyex commented 3 years ago

For questions raised by @janezd :
(Just some thoughts in case it will be helpful)

I believe gettext and "_" (as in the last comment) works fine for 1-3.
For question 4, there is indeed some pain, most in cases where strings are used both as text and identifiers (I had to use some workarounds in my try). For 5, as long as people accept open-source projects to have partially translated strings, it works fine. Changing the wording may make certain text need re-translated, but it's a common practice in open source projects. See the Scratch project, which uses Transifex (and there are a list of online platforms, many are free for open-source projects) to coordinate translators around the world (this issue is harder in Scratch since its users are kids, and they indeed have a team for this). For 6, Scratch uses a framework for add-on translation and leaves it in the add-on developer's hands. This does need some consideration, but maybe it's not the top priority for now?

janezd commented 3 years ago

I wouldn't like to discourage you: I understand the need and appreciate your enthusiasm. I'm just being cautious.

OK
Yes, but it makes the code inconsistent. And prohibits introduction of _ to code where _ is already used for gettext.
I don't see how translating f-strings could work. I believe it doesn't. Have your tried? See below.
It would be our obligation to do this. But OK, I wrote this one is not so hard.
If we commit, we commit. We'd be obliged to not burden the translators, and fell bad if we do. That's why I wouldn't want to without giving it a serious thought.
This wouldn't look nice.

I was involved in translating Scratch to Slovenian. It's incomparable. The number of messages in Scratch is very small, while in Orange it's huge. Scratch's messages are also not actively changed. Scratch doesn't get new blocks all the time. It doesn't have problems with f-strings and _...

Regarding translation of f-strings:

>>> def _(x):
...    if x == 'f"{n} instances"':
...       return 'f"{n} primerov"'
...    return x
...
>>> n = 12
>>> f"{n} instances"
'12 instances'
>>> _(f"{n} instances")
'12 instances'

The last line should be 12 primerov, because it's supposed to be translated. But it's of course not because _ receives a string which is already interpolated.

bigeyex commented 3 years ago

@janezd You're right on 3 - I double-checked my code, gettext does not translate f-strings. There are some tricks (the best I saw is using {_('static text')} inside f-strings)but it may need some time to try or wait for some PEP. I surveyed Mu - another python project - and it doesn't use f-strings unfortunately...

I understand the need to be cautious about adding new structures to the codebase. That's why experiments and discussions might be helpful. (what I did before is merely a hobby project and I'm not pushing an agenda of making Orange translatable)

For other questions:

When "" is occasionally used as a throwaway variable, it can be "found back" with ` = translate.gettext`. While I admit it's not perfectly clean, but since throwaway variables are not supposed to be actively used and when people see "_" used as a function, it always refers to gettext, it won't be too much of a trouble.
While Scratch doesn't get new blocks all the time, its flashcards, documentation, and examples are actively refreshed. They (Transifix) give notices to the translators (and verifiers like what I was at some time) when new versions are going to release so they can fix new issues if the software is used in some serious educational settings. Maybe measures like these are needed after some Orange books (e.g. in Chinese) are published and used in classrooms, But from my personal experience, I won't see it as a burden unless the team makes changes like changing the capitalization pattern in every sentence.
It depends on whether an add-on is considered a "core" add-on of Orange. If it is a core one, maybe it's better to be maintained in the main translation file (or at least in the same Transifix workspace); otherwise, in my own experience, an add-on in Chinese is certainly the best, but an English version available is more than happy to have. However, I agree that before "Orange is translatable" is announced at some time, there should be some mechanism for add-on translations set up.

janezd commented 3 years ago

f-strings are expressions, so no gettext-style utility will work with them. See what "f'John is {x} years old.' compiles into:

>>> dis(compile("f'John is {x} years old.'", "<string>", "eval"))
  1           0 LOAD_CONST               0 ('John is ')
              2 LOAD_NAME                0 (x)
              4 FORMAT_VALUE             0
              6 LOAD_CONST               1 (' years old.')
              8 BUILD_STRING             3
             10 RETURN_VALUE

The whole string is never "materialized", so it can't be passed to any function like _.

I also can't imagine any PEP that would solve it. F-strings are really fast because there are no dunderscore methods (or functions like format) involved. I doubt they'd slow down the interpolation by adding such overhead.

Django, for instance, doesn't use f-strings for this reason.

bigeyex commented 3 years ago

A possible approach could be introducing something like "ft-string", which allows gettext to extract the text, and allow the template string to be replaced before compiling.

I saw the gettext people trying to allow things like f"{_('foo bar')}" to be captured when baking templates. No idea whether such PEP will be proposed (maybe there are not so many users asking for it?)

Django indeed doesn't use f-strings in translatable scenarios. Although they allow using it in other settings.

janezd commented 3 years ago

"ft-string", which allows gettext to extract the text, and allow the template string to be replaced before compiling.

How would that work? In f-strings there is no string to pass. When you say f"John is {x} years old", a string like "John is {x} years old." is never constructed, it never appears in the memory. There is nothing to be sent to gettext. Carefully read the above dissasembly again: there are actually three strings "John is ", str(x) and "years old.", which Python pastes together. When the entire string is composed (and could/can be passed to gettext), the value of x is already inserted. Before inserting x, there is no string. No way around this.

This is also explained in the link you sent. Read it through.

things like f"{_('foo bar')}"

It's even worse, they use nested f-strings. I suppose this was meant as a hack, not to be actually used, because it defeats the purpose of f-strings, this is more complicated than str.format.

irgolic commented 3 years ago

"ft-string", which allows gettext to extract the text, and allow the template string to be replaced before compiling.

How would that work? In f-strings there is no string to pass. When you say f"John is {x} years old", a string like "John is {x} years old." is never constructed, it never appears in the memory

I think they were referring to a pre-compilation step, akin to CSS minification. If a solution like that exists, it would apply to our use case, and might also allow for more complex translations in the vein of flipping word order.

irgolic commented 3 years ago

I went down this mailing list https://mail.python.org/pipermail/python-ideas/2018-September/053441.html

One interesting thing that got brought up was PEP 501, which was deferred: https://www.python.org/dev/peps/pep-0501/

But they eventually land on the idea @bigeyex proposed, writing a preprocessor/precompilation step with something like parso https://parso.readthedocs.io/en/latest/index.html.

How about this: Have a script that runs over your code, looking for "translatable f-strings": _(f'Hi {user}') and replaces them with actually-translatable strings: _('Hi %s') % (user,) _('Hi {user}').format(user=user) https://mail.python.org/pipermail/python-ideas/2018-September/053552.html

irgolic commented 3 years ago

Also, to reduce clutter, could we apply translations in the orangewidget.gui modules? Most strings/names go through there, don't they?

janezd commented 3 years ago

I haven't understood @bigeyex this way. And the example from the mailing list is vague about how to find such strings. Regular expressions?!

I was thinking about the last moment in which f-strings are still whole, which led me to towards a similar idea for translation, but it's one that doesn't require any changes in Orange's current code.

The second part would be a script, which tokenizes a file, extracts all STRING tokens and saves them into a .pot file.
Translators use .pot file as usual (create a .po, translate, merge...)
The last part of the script demonstrates translating source. For this demo, it will "translate" all longer strings into uppercase. In the real world, it would read a .mo or .po file and replace all strings it finds.

Save this file as create_pot.py, run it and see create_pot-trans.py.

import tokenize

# Just some random stuff with strings, which need to be "translated"
print("This is a string")
x = 42
if x > 1:
    print(f"And there could be {x} more.")
print("And this one is {}.".format("formatted"))

# This part of the script is an equivalent of xgettext
fname = "create_pot.py"

with open("messages.pot", "wt") as pot, open(fname, "rb") as source:
    for token in tokenize.tokenize(source.readline):
        if token.type == tokenize.STRING:
            msg = token.string
            if msg[0] == "f":
                msg = msg[1:]
            pot.write(f"""
#: {fname}:{token.start[0]}
msgid {msg}
msgstr ""
""")

# This part demonstrates translation of source file
tokens = []
with open(fname, "rb") as source:
    for token in tokenize.tokenize(source.readline):
        if token.type == tokenize.STRING:
            msg = token.string
            fstring = msg[0] == "f"
            if msg[0] == "f":
                msg = msg[fstring:]
            # For a demo, it converts all strings with more than 15 chars into upper case.
            # In practice, it would read translations from .mo or .po file
            if len(msg) >= 15:
                token = token._replace(string="f" * fstring + msg.upper())
        tokens.append(token)

with open(fname[:-3] + "-trans.py", "wb") as trans:
    trans.write(tokenize.untokenize(tokens))

Note that this does not require any _ or anything. No changes in code.

However, gettext has mechanisms for handling plural forms. I can think of a solution here, too, but it's ... somewhat ugly.

Also, to reduce clutter, could we apply translations in the orangewidget.gui modules? Most strings/names go through there, don't they?

Some, but very far from all. Aleš actually avoids gui, and has his very sensible reasons to.

irgolic commented 3 years ago

Great stuff @janezd. We've had numerous people reach out to us about translations, if we can get something like this going I think this is something the extended community could really contribute to.

Two thoughts:

I've no experience with gettext, but I looked up their plural handling mechanisms, looks cool and intricate. Could this type of solution be connected with the internals of gettext to use their translation/plural handling system? Or maybe could this be written as an extension of gettext?

Could we write this as an import hook, making it a completely on-the-fly thing? If so, using a different language in Orange would definitely require a restart, which I think is perfectly fine from a UX standpoint.

janezd commented 3 years ago

This changes sources and can only help someone prepare an installation in another language. Changing sources in place and forcing python to recompile them would be a very bad idea.

This solution has nothing to do with gettext, except for using its file format for storing the messages.

Think again what this solution does: it changes hard-coded messages, while gettext translates them on the fly. Gettext can adapt to plural forms (the mechanism is not very intricate, it's quite trivial), while this solution obviously can't, without adding some if's (where I would hesitate to go).

I posted this as an example of how somebody could translate Orange and relatively easily maintain the translation without core developers being concerned or involved. No import hooks or similar tricks that are bound to cause us headaches.