Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.56k stars 974 forks source link

File names for translated vignettes? #6221

Open tdhock opened 2 months ago

tdhock commented 2 months ago

Hi! @Rdatatable/french are planning to translate vignettes to French. what should the file names for the translated vignettes be?

in the existing directory: datatable-intro.Rmd datatable-intro-fr.Rmd datatable-intro-de.Rmd datatable-intro-pl.Rmd ....

or: vignettes/fr datatable-intro.Rmd ... vignettes/de datatable-intro.Rmd ... vignettes/pl datatable-intro.Rmd ...

or: vignettes/po/fr datatable-intro.Rmd ... vignettes/po/de datatable-intro.Rmd ...

vignettes/po/pl datatable-intro.Rmd ...

or: other ???

My tendency would be to do it in the existing directory. I'm not sure if sub-directories are possible?

ChristianWia commented 2 months ago

advantage would be to have the locale directly in the file name.

MichaelChirico commented 2 months ago

@eliocamp is this something in scope for the documentation working group? Are there any preliminary recommendations you can suggest here?

eliocamp commented 2 months ago

Yes, it is very much in scope. No, I have no recommendations yet 🥲.

phgrosjean commented 2 months ago

Perhaps testing also various options with {pkgdown} to see which one presents better. In the existing directory, I am afraid it would end up in a long list of vignettes.

leofontenelle commented 2 months ago

On Linux, it's usually something like help/appname/(C|fr|zn_CN|pt_BR)/the actual documentation. Examples: LibreOffice documentation, GNOME user documentation, a KDE app complete with its own documentation, GNU/Linux man pages. For the C-locale man pages, there's no C, en or en-US directory.

If there was some way do make .pot files out of vignettes, I guess each vignette would be its own domain, all .po files would be in po/ in the source code, and the directory structure above would be created at compile time.

ChristianWia commented 2 months ago

May be some solution around Python package pip install mdpo allowing md2po and then reverse po2md to find back the translated vignette. Must still investigate how much information we lose.

Commands:

md2po datatable-intro-fr.Rmd --quiet --save --po-filepath e:/datatable-intro-fr.po
po2md e:/datatable-intro-fr.Rmd --pofiles e:/datatable-intro-fr.po --save e:/datatable-intro-fr2.Rmd

Issue: https://github.com/mondeja/md-ulb-pwrap/issues/7

my tests going on -> https://github.com/ChristianWia/vignettes

eliocamp commented 2 months ago

In our work to get multilingual documentation, we are thinking that the translated documentation would live in its own translation module. So the French helpfiles for data.table would be in a package called data.table.fr (or whatever, the name is not important) and any user who would like to see the documentation in that language would install it. The idea is that this would decouple translations from the original package and users won't need to get translations in a language they don't need.

I think that model might also work for vignettes.

phgrosjean commented 2 months ago

OK, I see. Then, in the meantime, we could place vignettes translation in French in an 'fr' subdirectory. Once that new mechanism will be available, we could easily create the corresponding repository and transfer these files. Two questions to @eliocamp:

  1. {data.table.fr} should be versioned, right? There should be a version for each version of the original {data.table} package. Otherwise, there is a risk of a wrong man page or vignette. This multiplies the maintenance work on many packages, but it is currently the case for the translation teams anyway. What happens when there is no version concordance between {data.table} and {data.table.fr}? A fallback to the closest available version and a warning on the top of the translated man page, or what?

  2. Should the end user call library(data.table.fr) instead of library(data.table)? (this could be a problem with code shared in a multinational context), or {data.table.fr} is detected and used automatically by {data.table}, depending on something like Sys.getenv("LANG")?

eliocamp commented 2 months ago
  1. For documentation, there's no need to version the translation module with the original package because the string replacement is done based on strings and the structure of the Red file. So as long as the documentation doesn't change, then the translation stays current and useful. Still we might want to state which version is being translated (with a special field in the DESCRIPTION). I don't know if vignettes can be translated using this system exactly, but the translation module version shouldn't be tied to the original package version (package version could change without meaningful changes in documentation and the translation module might be updated independently)

  2. The latter. The user never has to load the translation module directly.

MichaelChirico commented 1 month ago

As long as it actually works, I agree for now putting it in subdirectories is the way to go. Please pass along any learnings in this process to the R documentation working group team -- @eliocamp would https://github.com/RConsortium/multilingual-documentation-wg or https://github.com/eliocamp/rhelpi18n be the better place (I assume the former).

eliocamp commented 1 month ago

Yes, I think these high-level discussions are better had in https://github.com/RConsortium/multilingual-documentation-wg

ChristianWia commented 1 month ago

Impact of LOCALE on vignette translation

If we agree the translated vignette is a clone of the EN one (same YAML, same skeleton), several elements should be considered during translation.

1 vignettes not using common resources : no pb, translate only the .Rmd path to vignette is free (once the directories are defined)

2 vignette using common resources : Ex: datatable-sd-usage.Rmd using directories ./css and ./plots In this case the translated vignette should be among others to benefit of the same directory structure.

2.1 access to CSS : CSS is shared between vignettes to get the menu an section numbering (today). It is independant of the Locale but may be not always, if we consider scripts LTR and RTL

2.2 access to images and other medias :

2.2.1 médias not relying on Locale : This is the case of images without EN texte. no pb - keep the EN existing transclusions

2.2.2 medias depending on the Locale : This is the case of schemes, architectures, interfaces, flows, spreadsheets... containing EN text. More of that if the .Rmd describes what is on the image, both must be coherent.

2.2.2.1 .Rmd does not describe the image no pb , keep the EN image

2.2.2.2 .Rmd describes the image

2.2.2.2.1 either we keep the EN image : In this case the translated text should use the EN terms for coherence.

2.2.2.2.2 or we create a Locale image (*) : In this case the .Rmd should use the Locale terms of the translated image for coherence.

(*) I think it is possible to modify the contents of an .svg to translate the displayed text (to investigate)

aitap commented 2 days ago

Part of the appeal of vignettes is that they are already part of the package and accessible offline.

Some very approximate testing shows that enabling the French vignettes (#6455) to render adds more than 30% to the R CMD build time (which may be not that much of a problem because make build skips the vignettes) and less than 15% to the R CMD check time (which takes away from the gains for #6400). The absolute increase for R CMD check is larger because it both weaves the vignettes and re-runs the tangled scripts, but not twice as much because not all vignettes have code in them. The relative increase for R CMD build will be much more dramatic with MAKEFLAGS=-j$(nproc).

french-vignettes

Caching the R results in the English vignettes for reuse in the translations would be hard to implement and is unlikely to help much: time is also spent inside knitr, rmarkdown and pandoc.

If the translated vignettes are included in the data.table package, the sorting order of the vignettes may also become a problem. (Should the translation be sorted near the original? Should the vignettes in the same language be sorted together?) Without \VignetteIndex, vignette files would have to be renamed to achieve the desired order.

raw script ```r # this was without OPENBLAS_NUM_THREADS=1, so the CPU load is above normal # most of the process doesn't use linear algebra anyway elapsed <- \(s) { s <- regmatches(s, gregexec('(\\d+):([\\d.]+)elapsed', s, perl = TRUE))[[1]][2:3,] as.numeric(s[1,])*60 + as.numeric(s[2,]) } d <- rbind( cbind(kind = 'vignettes/fr/*', rbind( data.frame(process = 'build', time = elapsed(' 52.84user 14.42system 0:38.99elapsed 172%CPU (0avgtext+0avgdata 1270424maxresident)k 53.32user 14.27system 0:38.61elapsed 175%CPU (0avgtext+0avgdata 1270228maxresident)k 53.36user 14.05system 0:38.53elapsed 174%CPU (0avgtext+0avgdata 1270548maxresident)k ')), data.frame(process = 'check', time = elapsed(' 138.89user 44.84system 2:06.32elapsed 145%CPU (0avgtext+0avgdata 1270412maxresident)k 137.14user 44.95system 2:04.29elapsed 146%CPU (0avgtext+0avgdata 1270192maxresident)k 143.14user 48.11system 2:10.45elapsed 146%CPU (0avgtext+0avgdata 1270584maxresident)k ')) )), cbind(kind = 'vignettes/*-fr*', rbind( data.frame(process = 'build', time = elapsed(' 82.31user 23.16system 0:51.61elapsed 204%CPU (0avgtext+0avgdata 1313952maxresident)k 82.02user 23.47system 0:51.31elapsed 205%CPU (0avgtext+0avgdata 1314172maxresident)k 81.63user 23.42system 0:51.24elapsed 205%CPU (0avgtext+0avgdata 1313888maxresident)k ')), data.frame(process = 'check', time = elapsed(' 171.63user 59.06system 2:21.92elapsed 162%CPU (0avgtext+0avgdata 1313928maxresident)k 173.07user 63.13system 2:23.48elapsed 164%CPU (0avgtext+0avgdata 1313184maxresident)k 175.47user 58.57system 2:26.24elapsed 160%CPU (0avgtext+0avgdata 1313056maxresident)k ')) )) ) lattice::barchart(time ~ kind | process, aggregate(time ~ kind + process, d, mean), ylim = c(0, max(d$time))) ```
phgrosjean commented 2 days ago

It would be great to minimally impact the package check/compilation/installation with translations. If we got a convention that the -lang- translation of vignettes is in Rdatatable/data.table.-lang- GitHub repos, and that these repos mainly serve to compile a localized {pkgdown} site, it is relatively easy to compute links to corresponding pages in the original vignettes. Also, a link back to the original English vignettes can be added in the translations.

The installation of the {data.table.} package is not necessary. Only if users want offline versions of the vignettes.

It seems to me to be a relatively simple solution to this problem for now.