Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.57k stars 974 forks source link

Make plural alternatives as plural-forms messages #6194

Closed rffontenelle closed 2 days ago

rffontenelle commented 2 months ago

Some messages embed plural alternatives e.g. "column(s)" or "column%s" (where %s can be empty "" or "s") and this makes harder for translators to properly translate this message.

Some languages have plural-form expressions far more complex than English e.g. while English alternate between singular or plural, Slovenian has 4 plurals. Even for those that have the same plural-form expression as English, appending multiple "(s)" to denote plural alternative makes harder to read, mainly for screen readers used by people who suffers for visually impairment.

A better solution is to split such messages in plural-forms using ngettext.

Some references https://michaelchirico.github.io/potools/articles/developers.html#plurals and (one personal favorite) https://wiki.gnome.org/TranslationProject/DevGuidelines/Plurals

For me, as Brazilian Portuguese translator, there are cases that I can work around the situation, but there are times I need to add way more "(s)" than the original message. Hence, using plural-forms messages would be much more cleaner.

Here is a non-exhaustive list of messages I found in the source code with such problem:

https://github.com/Rdatatable/data.table/blob/f5a1e09328fef5f8a4c5e225bc5b75ca09700f76/src/assign.c#L394

https://github.com/Rdatatable/data.table/blob/f5a1e09328fef5f8a4c5e225bc5b75ca09700f76/src/rbindlist.c#L57

https://github.com/Rdatatable/data.table/blob/f5a1e09328fef5f8a4c5e225bc5b75ca09700f76/src/utils.c#L289

https://github.com/Rdatatable/data.table/blob/f5a1e09328fef5f8a4c5e225bc5b75ca09700f76/R/data.table.R#L721

https://github.com/Rdatatable/data.table/blob/f5a1e09328fef5f8a4c5e225bc5b75ca09700f76/R/data.table.R#L735

https://github.com/Rdatatable/data.table/blob/f5a1e09328fef5f8a4c5e225bc5b75ca09700f76/R/foverlaps.R#L157

https://github.com/Rdatatable/data.table/blob/f5a1e09328fef5f8a4c5e225bc5b75ca09700f76/R/print.data.table.R#L270-L273

MichaelChirico commented 2 months ago

Thanks for tracking these down! Yes please switch to ngettext(). So far they haven't been sussed out because Mandarin doesn't really have plural forms, so translating there is easy without using ngettext().

MichaelChirico commented 2 months ago

For the C messages, we don't have any ngettext() usage there yet, so please ensure we are properly set up to start doing plural translations there:

  1. Make sure the package compiles and can pass tests, obviously :)
  2. Make sure the recommended translation workflow is picking these new strings up

https://github.com/Rdatatable/data.table/blob/de0cf94b0ffb004cb3d6e21187b449f803809931/.dev/CRAN_Release.cmd#L11-L26

At a glance, I don't think it will -- I don't see ngettext() mentioned there.

aitap commented 4 weeks ago

A couple more in C code: https://github.com/Rdatatable/data.table/blob/d2f6e1d97c880f97962763179a2b7c727f96cd59/src/fread.c#L1595 https://github.com/Rdatatable/data.table/blob/d2f6e1d97c880f97962763179a2b7c727f96cd59/src/fwrite.c#L955-L958

MichaelChirico commented 3 weeks ago

This is particularly important for ru translation, which has 3 plural forms (Romance languages can usually match English style pluralization, i.e. 2 plural forms, and Chinese has no pluralization).

rikivillalba commented 3 weeks ago

Another case here

autoFirstColName = (tt==ncol-1);
DTWARN(_("Detected %d column names but the data has %d columns (i.e. invalid file). Added %d extra default column name%s\n"), tt, ncol, ncol-tt,
        autoFirstColName ? _(" for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.") : _("s at the end."));

Depending on autoFirstColName being != 0 "name" is appended a "s at the end" or the message is expanded in the singular case.

MichaelChirico commented 1 week ago

Thanks for flagging these. I also find more complicated cases like:

https://github.com/Rdatatable/data.table/blob/97980d9a0795095e8b32d5bbf315da7e087e39fd/src/frollR.c#L178

I believe allowing fully proper translation of those will require nested ngettext(), with a pretty bad impact on code readability. Shall we just leave those as is for now? What do the translators think?

rikivillalba commented 1 week ago

https://github.com/Rdatatable/data.table/blob/97980d9a0795095e8b32d5bbf315da7e087e39fd/src/dogroups.c#L449

There are two issues here: 1 - gettext tools like msgmerge complain about \r as "non portable" 2 - %d groups can be 1 group or n groups?

MichaelChirico commented 1 week ago

1 - gettext tools like msgmerge complain about \r as "non portable"

I thought that might be the case, but that message is in .pot/.po already with no reported issue*:

grep -Er "[^\\][\\]r" po/*.pot
# po/data.table.pot:"\rProcessed %d groups out of %d. %.0f%% done. Time elapsed: %ds. ETA: %ds."
# po/data.table.pot:"\rProcessed %d groups out of %d. %.0f%% done. Time elapsed: %ds. ETA: %ds.\n"

Should combine those messages too...

*that might not be true, this message is pretty new...

aitap commented 1 week ago

A related problem (should I create a new issue?) is applying grammatical case to nouns. For example, here "target vector" would be "целевой вектор", but later "Assigning factor numbers to %s", "target vector" must be "Присваиваю фактор целевому вектору". Not that big of a deal; I think that all uses of targetDesc in assign.c are in the dative case and so can be translated pre-inflected, but I'm not sure this would remain the case for Arabic and Hindi.

Also a problem might be concatenation of messages in fread.c [1, 2], but it doesn't pose a problem for me. Are there languages with very strict word order that would break this sentence? This, on the other hand, is already slightly broken because msg is a full sentence.

rikivillalba commented 1 week ago

I believe allowing fully proper translation of those will require nested ngettext(), with a pretty bad impact on code readability. Shall we just leave those as is for now? What do the translators think?

I agree. IMHO such a pursuit for grammatical correcteness must be secondary to simplicity. Also it complicates translation. On the other hand, some messages could be put in a more grammatical number independent form, i.e. instead of "1 item / %d items" one could say "(affected items: %d)", when possible. Sometimes that could be made in the translations directly.

MichaelChirico commented 1 week ago

@aitap let's open a separate issue for the fragmented sentences you've identified. indeed languages like Turkish and Japanese often have inverted sentence order that makes translating fragments pretty awkward.

MichaelChirico commented 1 week ago

I believe allowing fully proper translation of those will require nested ngettext(), with a pretty bad impact on code readability. Shall we just leave those as is for now? What do the translators think?

I agree. IMHO such a pursuit for grammatical correcteness must be secondary to simplicity. Also it complicates translation. On the other hand, some messages could be put in a more grammatical number independent form, i.e. instead of "1 item / %d items" one could say "(affected items: %d)", when possible. Sometimes that could be made in the translations directly.

Sounds great. Yes, let's do that on an individual basis where seen fit then.