Multiple small issues with messages in the C code

aitap commented 2 months ago

This is a continuation of #6482. I think I won't find any more issues in the C code messages. Some of these are questions, a few indicate real translation hurdles.

[x] This one is actually possible for me to translate in parts (so we can keep it as is if needed), but I don't like translating individual words and relying on the rest of the code not to need a different translation: https://github.com/Rdatatable/data.table/blob/944340901d5b37cfdac43c9cc5b8695c5c12f5a0/src/fread.c#L1933-L1934 Translation in parts could be done reliably with pgettext, which seems to be currently unavailable even to C code in R's built-in copy of gettext on Windows.
[x] This one is also translatable in context, but isn't too painful to recombine into a single sentence: https://github.com/Rdatatable/data.table/blob/944340901d5b37cfdac43c9cc5b8695c5c12f5a0/src/fsort.c#L255-L259
[x] This one is very awkward to translate, and we're again relying on _("no") not to mean something else in a different part of the code: https://github.com/Rdatatable/data.table/blob/944340901d5b37cfdac43c9cc5b8695c5c12f5a0/src/fread.c#L2038 Unfortunately, expanding this sentence will duplicate the full line of the code.

The rest are questions unrelated to translations:

[x] I think we shouldn't be calling it ASCII if its / is not immediately before 0. How about "The character '/' is not just before '0' in the source character set"? https://github.com/Rdatatable/data.table/blob/944340901d5b37cfdac43c9cc5b8695c5c12f5a0/src/init.c#L225 https://github.com/Rdatatable/data.table/blob/944340901d5b37cfdac43c9cc5b8695c5c12f5a0/src/init.c#L227
[x] This one probably used to be correct in the past, but now should refer to a different variable, .Last.updated: https://github.com/Rdatatable/data.table/blob/944340901d5b37cfdac43c9cc5b8695c5c12f5a0/src/init.c#L367
[ ] This message is technically correct because the code does expect only symbols at this point, but data.table:::list2lang converts character strings to symbols. Would it be more helpful to mention character strings in the error message too? https://github.com/Rdatatable/data.table/blob/944340901d5b37cfdac43c9cc5b8695c5c12f5a0/src/programming.c#L16
[x] Should this mention the issue tracker instead? https://github.com/Rdatatable/data.table/blob/944340901d5b37cfdac43c9cc5b8695c5c12f5a0/src/fmelt.c#L805
[x] Is LENGTH(.) < 0 something that's possible in R ≥ 3.3.0? https://github.com/Rdatatable/data.table/blob/944340901d5b37cfdac43c9cc5b8695c5c12f5a0/src/fmelt.c#L66

(Checkboxes indicate either no action needed or a corresponding fix being suggested in #6504)

tdhock commented 2 months ago

Should this mention the issue tracker instead?

yes

MichaelChirico commented 2 months ago

This one is actually possible for me to translate in parts (so we can keep it as is if needed)

I guess you're referring to few/many, not the jump>0 complete sentence added at the end. If so: yes, I think we should branch it.

I think we shouldn't be calling it ASCII if its / is not immediately before 0. How about "The character '/' is not just before '0' in the source character set

Hmm, good point, though I worry "source character set" is a highly technical term --> relatively tough to translate. WDYT about "Unlike the very common case, e.g. ASCII, the character '/' is not just before '0'".

Would it be more helpful to mention character strings in the error message too?

Ping @jangorecki who has the best context here

Is LENGTH(.) < 0 something that's possible in R ≥ 3.3.0?

Good spot. Has it ever been possible? My best guess was this was a somewhat half-baked fix here:

https://github.com/Rdatatable/data.table/commit/437dc39290d9010847bac8bee1000f94d01161a1

Author fixed the length=0 case to work and changed the message to adapt without stopping to think "length<0 is not possible" instead of "how do I adapt x<=0 | x>0 to include x=0 valid? x<0 | x>=0".

So I think we can just drop that condition and the corresponding part of the message.

aitap commented 2 months ago

I guess you're referring to few/many, not the jump>0 complete sentence added at the end. If so: yes, I think we should branch it.

Yes, I mean the part about few/many. Could you please clarify what you meant by "branch it"?

I worry "source character set" is a highly technical term --> relatively tough to translate. WDYT about "Unlike the very common case, e.g. ASCII, the character '/' is not just before '0'".

Agreed, this phrasing sounds fine.

My best guess was this was a somewhat half-baked fix here:

437dc39

...which started out as a check for "opposite of length() > 0". Will remove the condition and adjust the message.

MichaelChirico commented 2 months ago

Could you please clarify what you meant by "branch it"?

thisNcol < ncol ? _("A line with too few fields...") : _("A line with too many fields...)"

jangorecki commented 2 months ago

No idea really, but I can imagine some hacks in setting up length to negative value, that could have been in play.

rikivillalba commented 2 months ago

some contribs

[x] should not be translatated: https://github.com/Rdatatable/data.table/blob/3b376dab6092c67dead7197c858f450109cb519b/src/fread.c#L1343
[x] messages like this are often hard to translate https://github.com/Rdatatable/data.table/blob/3b376dab6092c67dead7197c858f450109cb519b/src/gsumm.c#L224-L225
[x] Better to write STOP(_("Internal error in %s: %s. Please report to the data.table issues tracker", __func__, internalErr); https://github.com/Rdatatable/data.table/blob/3b376dab6092c67dead7197c858f450109cb519b/src/fread.c#L2627
[ ] part of above message, perhaps difficult to translate out of context. https://github.com/Rdatatable/data.table/blob/3b376dab6092c67dead7197c858f450109cb519b/src/fread.c#L1711

MichaelChirico commented 2 months ago

[aside/FYI @rikivillalba if your permalink includes column numbers (C10, C23 in your 2nd bullet before my edit), it won't render inline in the issue. I don't know why GitHub sometimes includes the column numbers]

aitap commented 2 months ago

@jangorecki I'm not seeing calls to SETLENGTH in fmelt.c, but I do know that data.table uses negative TRUELENGTH for its internal purposes. Hmm.

@rikivillalba Thank you for reminding me! There's quite a lot of debugging printout that consists of argument/variable names that (arguably) should not be translated:

Which of these should have their _ removed? Which should be translated anyway?

MichaelChirico commented 2 months ago

for those, I think we need a system for marking "notranslate" at the same time, so that CI/release doesn't keep finding these strings that are in translateable calls, just not useful to translate.

In potools, that's // # notranslate.

Let's open another separate issue for those, it's a bit different from the topic at hand.

aitap commented 2 months ago

[x] This one indicates a problem with xgettext: the string eventually given to gettext() will not be translated because xgettext had split it into parts, including "%": https://github.com/Rdatatable/data.table/blob/d7e95d1fc1e102d3fd0c0cb448892e67116ac03a/src/assign.c#L944-L945

Edit: hmm, I don't see "at RHS position" at all in the official *.pot file. Nevertheless, I don't think xgettext handles compile-time string concatenation in general.

MichaelChirico commented 2 months ago

Re: the previous comment, any suggested fix? The first things that come to mind seem messy. Maybe we should functionalize the macro...

aitap commented 2 months ago

The core of the problem is that even if we do follow the recommendation to format the number into a temporary buffer and then use %s for all number formats, we still have a combinatorial explosion of 7 (TO) ⋅ 2 (target vector / column %d named '%s') strings that xgettext wants spelled out in the source code.

Can we cheat a little and split the clauses into two sentences? Then we'll have

_("%s (type '%s') at RHS position %d taken as TRUE")
_("%s (type '%s') at RHS position %d taken as 0")
_("%s (type '%s') at RHS position %d either truncated (precision lost) or taken as 0")
_("%s (type '%s') at RHS position %d out-of-range (NA)")
_("%s (type '%s') at RHS position %d out-of-range (NA) or truncated (precision lost)")
_("%s (type '%s') at RHS position %d had either imaginary part discarded or real part truncated (precision lost)")
_("%s (type '%s') at RHS position %d had imaginary part discarded")

and

_("Problem when assigning to type '%s' (target vector):")
_("Problem when assigning to type '%s' (column %d named '%s'):")

...which is more manageable. The strings could be rephrased further, making them truly separate sentences (The problem happened when assigning to type '%s' (target vector).).

MichaelChirico commented 2 months ago

Would it be more helpful to mention character strings in the error message too?

cc @jangorecki. I'm not sure what change you have in mind exactly, but I think the call-out to consult ?substitute2 is the most helpful part. I worry adding to the message will make it over-complicated.

MichaelChirico commented 2 months ago

part of above message, perhaps difficult to translate out of context.

Good spot, there's a similar fragment below. It's a bit hard to pull apart exactly how we'd make this friendlier. I think the best bet is to make each fragment a standalone sentence. I'll want to read more carefully how exactly those show up in the verbose output. Filing as a separate issue so it's not lost, there's already a lot going on here.

Rdatatable / data.table

Multiple small issues with messages in the C code #6503