Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.61k stars 985 forks source link

A Weblate instance for data.table translations? #6370

Open aitap opened 2 months ago

aitap commented 2 months ago

Cc: @tdhock

Dear @Rdatatable/translators,

Would any of the existing translation teams be interested in moving to a Weblate instance?

Many languages supported by the R project itself are now being translated on the translate.rx.studio instance instead of translation team leaders manually submitting archives of src/library/*/po/*.po files to the R core team shortly before a new version is released. Having participated in both processes, I must say that Weblate is more convenient (because someone else manages the repositories and regularly prepares patches), although of course tastes can differ.

It turns out that while Weblate's Markdown mode does not understand R Markdown well enough, it's still possible to translate the vignettes as plain text files. Every vignette will have to become a new translation project, with a file format of Plain text file, the file mask specifying the future translated vignettes as (for example) vignettes/datatable-intro-*.Rmd and a separately specified monolingual base language file vignettes/datatable-intro.Rmd:

Weblate screenshot showing settings for a "plain text" translation of the data.table intro vignette

You will have to disallow adding new translations, but as soon as a file matching the mask appears in the repository (e.g.), Weblate will recognise it as available for translation (e.g.).

@daroczig, if a data.table language team desires to move to https://translate.rx.studio, would it be possible to create data.table as a project there?

Alternatively, Codeberg Translate is a public Weblate instance hosted by Codeberg, where any free software project can get translated into various languages. We could register https://translate.codeberg.org/projects/data-table/ for the overall project. I was not aware of the option to move to https://translate.rx.studio, so the Russian translation team is currently set up at https://translate.codeberg.org/projects/data-table-ru/, but I could definitely move to https://translate.rx.studio or https://translate.codeberg.org/projects/data-table/ after we are done if that becomes an option.

rffontenelle commented 2 months ago

I find very difficult to translating Markdown files without filtering the proper markup syntax (that's what reading as plain text does). One can easily mess with the document formatting and such format error won't be popped up as a warning or error; it will simply be differently formatted, which is undesirable.

I'm believe Weblate uses translate-toolkit to recognize and convert into PO file. I don't know the difference between .md and .Rmd, but maybe an work could be done based on their Markdown format converter to add support to .Rmd?

aitap commented 2 months ago

The main problem is the initial YAML header that Weblate currently tries to line-wrap as plain text. I think that code blocks are ignored for translation, which is mostly reasonable, except for rare cases where a comment could be translated too.

I'll see what can be done for translate-toolkit. Maybe there is a way to make an extension over an existing format.

MichaelChirico commented 2 months ago

cc @eliocamp for vis on discussion of translating vignettes :)

@aitap ICYMI see #6221 for more on vignette translation.

daroczig commented 2 months ago

re using translate.rx.studio -- I'm happy to help with anything needed there if that's what the team decides to use :+1:

phgrosjean commented 2 months ago

@daroczig @eliocamp As we are currently translating data.table vignettes into French, I have written with @ChristianWia a rmdpo R package that handles the YAML header and the chunks correctly. It is here: https://github.com/SciViews/rmdpo. It converts a .Rmd (or .qmd). file into a .po file. and then, it generates the translated .Rmd/.qmd file from the .po file with the translated strings. We use poEdit to translate these string, as it is the tool used from the beginning for the R translation. rmdpo uses the mdpo Python library that has to be installed first.

By the way, I have a related question : I see a move towards Weblate. OK, but as the translator in French that uses poEdit (it is linked to DeepL that greatly helps in the translations), how do I merge my .po/.mo files from poEdit with what is done in Weblate ? The question is pertinent for vignettes too, since we also use poEdit together with rmdpo to translate these in French.

For vignettes, it is not sufficient to just translate strings in something like poEdit or Weblate. You must also test the translated vignettes, i.e., check that it knits without errors. So, a remote platform like Weblare is not enough for doing a complete job, I think.

aitap commented 2 months ago

@phgrosjean Thank you for mentioning rmdpo, this could be a much more convenient solution than translating bare .Rmd source! Poedit is a great tool. A team experienced with source control may well be more comfortable with Poedit, especially considering the fact that it takes much less resources to run and can work offline.

Weblate keeps a clone of the Git repository containing the text to translate, so changes done outside Weblate can be merged using Git with a bit of care to avoid Weblate-side conflicts.

phgrosjean commented 2 months ago

@aitap Thanks for the link, but it does not help much in case translation is partly done in Weblate and partly in poEdit to GitHub + possible duplicated work by totally unrelated people. One solution would be to decide, for each language, if Weblate or poEdit + GitHub is to be used... and to think about a better way to synch for the oldest solution. I am not happy to move to Weblate because I see no connection with DeepL (todays, all professional translators use something like DeepL to pre-translate and they rework that first, draft version... this is possible with poEdit Pro, what about Weblate vs DeepL ???)

rffontenelle commented 2 months ago

Weblate conponent can be set with pot file and keeping updating translation files and pushing them back. There would be no need for Poedit.

Weblate can also be set with various machine translation engines like Google Translate, Amazon Translate, DeepL. It is a matter of seting the API key and be aware of the billing that will come. Alternatively, translator can copy source string to clipboard and paste into a machine translation system/site and then copy the translation back.

phgrosjean commented 2 months ago

@aitap @daroczig @MichaelChirico @eliocamp @rffontenelle Back to the initial question related to the translation of R Markdown (vignettes) translation. We have now the experience in translating the {data.table} vignettes. There is still the question of where to place translated vignettes, cf #6221. Let's consider for now to place translated vignettes in a "<lang>" subdirectory (fr, es, ru, zn...), with same name as the original vignette. We end up, thus, for the French translation into :

-- vignettes
     |  datatable-intro.Rmd
     |  ...
     -- fr
     |  datatable-intro.Rmd. (French version)
     |
     -- <other_lang>
     |   ...

Vignettes contain code and YAML directives that may be related to other files (image, dataset, css file, ...). Obviously, we won't duplicate all these files in each subdirectory. So, relative links must be changed. For instance, in datatable-sd-usage.Rmd, there is in the YAML header css: [default, css/toc.css] that must be in the translated version css: [default, ../css/toc.css]. Also, dataset is loaded with load('Teams.RData'), to be translated into load('../Teams.RData') in a R chunk, plus something similar for a figure.

In Weblate, the translator is disconnected from the context (in a Web Browser, he even does not need to have R installed) and it is not easy to test if the translated vignette even compiles without error. With {rmdpo}, you have to work in a context where you have R installed. You compile the .po file, translate strings, create the translated .Rmd file and can knit it immediately to check if it compiles without error. If not, change in the .po file, rebuild the translated .Rmd, reknit and check (this could easily be automated). You can also immediately compare layout of the final document with the original.

For vignettes, according to our experience, you cannot completely decouple the translation and the compilation/layout of the vignette, which is what you may be tempted to do with Weblate. Not good for vignettes.

aitap commented 2 months ago

@phgrosjean, I actually do agree that Weblate should not be required for a translation team. The workflow you are currently using at the https://github.com/phgrosjean/rfrench repo obviously suits your team better than Weblate. You are also right that when translating vignettes, someone has to ensure that the result is still a valid vignette. Weblate makes it possible for people without version control experience do the bulk of the work at the cost of requiring the translation manager to integrate the fruits of their labor. This is also a valid trade-off.

it does not help much in case translation is partly done in Weblate and partly in poEdit to GitHub + possible duplicated work by totally unrelated people

If you'd like to continue this discussion, could you please elaborate? I think that as long as you merge your Git tree and Weblate's Git tree regularly, you should get the same experience as you usually get when multiple people work on the same project using Git. Accidentally translating the same string in two different ways is a possibility, but conflicts are always possible in decentralised version control.

DeepL is supported in Weblate. https://translate.rx.studio/ has Microsoft Translator suggestions enabled, but not DeepL. https://translate.codeberg.org/ has neither, probably due to their strict adherence to the free software principles.

tdhock commented 2 months ago

I read some weblate docs about support for a glossary https://docs.weblate.org/en/latest/user/glossary.html#glossary "Terms from the glossary containing words from the currently translated string are displayed in the sidebar of the translation editor"

I wonder if that would be useful for us? I see there is a glossary on the base R weblate, https://translate.rx.studio/projects/r-project/glossary/ but I'm not sure how it works / why it is useful.

The base R Brazil team made a glossary on their wiki https://contributor.r-project.org/translations/Conventions_for_Languages/Brazilian%E2%80%90Portugese-specific-translations.html#glossary

phgrosjean commented 2 months ago

OK, thanks for these details. As I understand it, with Weblate, there will be someone responsible to the integration of the translation of vignettes, so that he makes sure the translated versions compile and produce the correct results. And that person is probably different to the one that translated the strings. This is a different philosophy than {rmdpo} that puts that responsibility to the translator itself.

Hopefully, we could have the choice between the two approaches, but I am not sure because the R Core Team may decide to go only with Weblate in the future, cf a citation from Martin Maechler in a mail on 2024-04-10 addressed to various R translators :

"Yes, we have been in a phase of transition for more than a year, and I guess it's still the case currently, but it looks that the new weblate-/community-based approach has become (relatively) stable and is being used for most languages/translations now, so the traditional approach is no longer more efficient (than the new one) for R-core ... given that we we have the whole bundle from the weblate community site anyway."

I will probably have to adapt... but I am not happy with this.

phgrosjean commented 2 months ago

@tdhock Glossary is very useful. We have one for French here : https://github.com/phgrosjean/rfrench/blob/main/RFrenchDictionary.txt

ChristianWia commented 2 months ago

related to https://github.com/Rdatatable/data.table/issues/6370#issuecomment-2298431774 herabove the advantage of weblate for vignettes is we use translation issued from a common pool involving R products terminology and control about strings (syntaxic by the tool and validity by the community) and cross-use check among different files - things that Poedit doesnot provide. Input .pot files imported to Weblate would be provided from a script1 analysing the EN .Rmd and the translated .po files would be exported from Weblate and recombined via a script2 to provide back the translated .Rmd file .

rffontenelle commented 2 months ago

I read some weblate docs about support for a glossary https://docs.weblate.org/en/latest/user/glossary.html#glossary "Terms from the glossary containing words from the currently translated string are displayed in the sidebar of the translation editor"

I wonder if that would be useful for us? I see there is a glossary on the base R weblate, https://translate.rx.studio/projects/r-project/glossary/ but I'm not sure how it works / why it is useful.

The base R Brazil team made a glossary on their wiki https://contributor.r-project.org/translations/Conventions_for_Languages/Brazilian%E2%80%90Portugese-specific-translations.html#glossary

@tdhock The advantage of the in-app glossary is that the translation flows better to see the glossary terms while navigating and translating strings. In Weblate, the terms that have glossary entry are highlighted in the source string, which is awesome to catch the attention of the translator for a project-wide consistency.

I don't know the history of the glossary of Brazilian Portuguese wiki page, but I surely used to populate my glossary when translating data.table and surely filled R's weblate instance as well.

FWIW, there an effort in R for improving consistency of glossary entries across languages.

daroczig commented 2 months ago

Sorry for the delay with my replies, and also for any duplicates in my responses that was already covered -- I was just reading back this busy thread :)


https://github.com/SciViews/rmdpo

This looks awesome; thanks for sharing, @phgrosjean! It's using standard PO files (as defined by GNU gettext), right? The GitHub repo description says it converts to "poEdit files".

as the translator in French that uses poEdit (it is linked to DeepL that greatly helps in the translations)

We have configured MS Translator for optional auto-translations in Weblate, which is free for up to 2M characters/month.

image

There are many other ML tools that we could also integrate if there's interest: https://docs.weblate.org/en/latest/admin/machine.html (including DeepL)

how do I merge my .po/.mo files from poEdit with what is done in Weblate

Weblate stores all the PO files in a git repo, as @aitap pointed out, but I think if you want to work totally independent from weblate, that's fine too: feel free to create PRs in the data.table repo, and once that's merged to main, weblate will automatically pull that in.

I am not happy to move to Weblate because I see no connection with DeepL (todays, all professional translators use something like DeepL to pre-translate and they rework that first, draft version... this is possible with poEdit Pro, what about Weblate vs DeepL ???)

@phgrosjean if that's your only concern, see above -- there are many AI tools that can be integrated in weblate. like @rffontenelle also pointed out.

translation is partly done in Weblate and partly in poEdit

@phgrosjean again, please see above -- IMO folks can do whatever workflow they wish, as weblate is not enforcing anything on you .. if you submit direct PRs to the data.table repo, and the maintainers are happy with that, weblate will silently pull that in

In Weblate, the translator is disconnected from the context

@phgrosjean weblate (actually PO files) has pretty good support for sharing context, see e.g. how it's set up for the strings extracted from the C files of base R:

image

(link to the above screenshot: https://translate.rx.studio/translate/r-project/base-r-gui/hu/?q=state:empty#machinery)

weblate users can click on the link and check the sources if needed.

and it is not easy to test if the translated vignette even compiles without error

I agree, although we can set up extra/custom checks in weblate .. I am not sure if we have the resources to do full Rmd compilation with preview etc


IMO the great advantages of weblate are

  1. the ease of use for users,
  2. standardization for the language team leads (e.g. glossary, automated quality checks, optional review system),
  3. and less work for the core dev team.

But I don't want anyone feel pressured to use weblate -- the site was set up as an optional tool to help folks translate, while not interfering/limiting others with an alternative preferred tool.

Regarding the original question: I think agreeing on and writing up the requirements (e.g. auto translate, collaboration, review system, offline usage etc) for the ideal translation tool might be the best first step here .. and then checking which software/workflow gets closest.

rffontenelle commented 2 months ago

To add to the awesome message above from @daroczig, it is possible to upload the translation file to Weblate. So even if the use of Weblate is enforce by a project maintainers, one can use the preferred translation editor software instead of the Weblate's web interface.

phgrosjean commented 2 months ago

Awesome, thanks to @daroczig and @rffontenelle for all these explanations. Obviously, I have to learn a little bit more the details of this system.

MichaelChirico commented 2 months ago

Adding again to the above with my position as {data.table} Committer:

All that matters to me is the .po file(s).

Weblate is just one way of producing those. There are many others (I'd be remiss to omit {potools}!). If one language's translator(s) prefer a different system, that's great.

Shifting to my position as R-devel Weblate admin, I see advantages to it, mainly stemming from the existing buy-in in the R ecosystem. There is already some investment in glossary/wiki pages for a broad selection of languages that can be re-used to benefit translators of package messages. Many of the translators working on {data.table} now have also contributed to R-devel translation through Weblate, so making that available will also reduce overhead to providing package translations -- less context switching.

rikivillalba commented 2 months ago

To add to the awesome message above from @daroczig, it is possible to upload the translation file to Weblate. So even if the use of Weblate is enforce by a project maintainers, one can use the preferred translation editor software instead of the Weblate's web interface.

As I do with some spanish translations for R! Weblate even will ask you wether you want to combine uploading translations with pre-existing translations, combine as fuzzy (need of review flag), etc.

To relieve Michael modesty, I must say potools is a great package too. I've used potools:::get_po_messages() with success to generate a data.table of out of .po/.pot files, translated them in external tool, then used potools::write_po_file() to generate a translation. (get_po_message is not exported perhaps it is a better method)

MichaelChirico commented 2 months ago

get_po_message is not exported perhaps it is a better method

I really need to export that... https://github.com/MichaelChirico/potools/issues/315

tdhock commented 1 month ago

a bit off topic but related: there is an effort to translate torch-related R package man pages to French, and the approach is making separate packages, such as https://github.com/cregouby/torchvision.fr

phgrosjean commented 1 month ago

This is the translation of the man pages. This is different than the translation of errors, warnings and other messages with gettext(). It uses the still experimental rhelpi18n package. Also, it is not a solution yet for vignettes.

aitap commented 1 month ago

In my opinion, translated man pages could be even more important than messages. An un-translated error message can at least be searched on the Web; a translated one might still require insider knowledge to understand ("I thought plonking has to do with USENET? Is there a woodchuck in data.table too?") and is much less likely to be found.

A "data.table." package could indeed just import(data.table); export(every_function_from_data_table) in its NAMESPACE and provide translated man/ and vignettes/ without an R directory at all, but the infrastructure to produce these translated files and keep them up to date is currently an open question. It would be much more convenient if R could look up translated help pages by itself. I wish @eliocamp's translation modules every bit of success and hope they get integrated into base R in one way or another.

phgrosjean commented 1 month ago

@aitap I agree!