Closed kohlhase closed 9 years ago
Are these numbers appearing in a context (say Babel), where the current language is something that could be determined? If so, we could reasonably have the number-parser adapt.
I was hoping you would say something like this, ... yes, they are in a babel context. So, yes it would be great if the number parser could adapt. I think the main realization is that we need an infrastructure where the parser can adapt to local customs. I find that interesting and hope you do too :-).
Looking at recent developments, the international standardization bodies have made the use of comma and dot as separators in numbers completely interchangeable (as long as it is consistent).
The wiki page refers to a resolution that:
The 22nd General Conference on Weights and Measures declared in 2003 that
"the symbol for the decimal marker shall be either the point on the line or the
comma on the line". It further reaffirmed that "numbers may be divided in groups
of three in order to facilitate reading; neither dots nor commas are ever inserted
in the spaces between groups".[10] This usage has therefore been recommended
by technical organizations, such as the United States'
National Institute of Standards and Technology.
It will probably be the context that holds the decisive cues on whether we are looking at a sequence, a grouping delimiter, or a decimal mark. The grouping delimiters are easy to spot - they are always in groups of 3, are used consistently and repeat throughout the number. If we see delimiters that don't match this, we are likely looking at a sequence of numbers.
Only the right-most mark is really ambiguous, and could be all 3 classes. Is 1,100
the number 1100, a decimal 1.1 or a sequence of two numbers 1 and 100 ? I don't think we can safely determine that without understanding the relevant parts of the context, just knowing the language helps only partially.
Not so fast, young whippersnapper! The recommendation was that groups be in 3's, not separated by commas or periods. But of course, people still do separate them by the opposite of the decimal marker, and they'll sometimes group them by 5's, for sure ("Table makers" tend to prefer 5); maybe other sizes. Note that the groups can also be in the fractional part! Since the most common case probably has only one group separator, it's hard to guess from just patterns, unless there also is a decimal separator...
So, 1,000.000 isn't necessarily a thousand.
Indeed, it gets harder with more scrutiny. :mag_right:
Contextual cues help. If we have a token soup containing numbers, spaces, dots and commas:
,
comma-followed-by-a-space or a -d
minus-followed-by-digit, that is a good point to guess we're looking at a sequence1,001
is a hard example for this case. Here the language could help, but it may also help to see if other numbers had clear delimiters used - if we assume a consistent style in the document, having parsed 1,000,000.12
earlier in the document will disambiguate this case for us. There is definitely interesting grammar work to be done here. Note: I made several updates to the comment, as I brainstormed later.
Yes, there is also some interesting grammar work to be done here, but in the spirit of our recent discussion on underspecified CDs, we should also think about the target representation. I am thinking of some symbol ambiguous-float
and represent Bruce's example of 1,000.000 as
<apply>
<csymbol cd="latexml" name="ambiguous-float">
<cn>1</cn>
<cn>000</cn>
<cn>000</cn>
</apply>
but now that I have written this, the <cn>
feel quite wrong, since <cn>000</cn>
is the same as <cn>0</cn>
.
In any case, we may want to have some underspecified representation here; my proposal is probably not right.
Having recently made some headway dealing with languages, and recording them, I notice that babel itself has no indicators of number formatting. However the numprint package does. It would be conceivable to check for the numprint settings and use the current decimal separator when parsing numbers. Maybe there could be some additional backdoor method to specify the separator.
We should also consider Brazilian among the languages where commas are used for a decimal separator, as indicated by the now closed #573.
Has there been any progress on this?
As much as I appreciate the general complexity of the problem, I do think we can pick some low-hanging fruit here easily. For instance, just make the current number parsing algorithm parametric in the language (which we pick up from babel, if present and otherwise just assume English), and set the separator accordingly. This "making parametric" in the language would probably make subsequent improvements easier.
I would really like to know about the ETA of the first level of support, otherwise we will have to work around this for the glossary.
It isn't inherently complex, just very awkward to find the right place to associate the language with it's decimal and thousands separators, and how to effect the control. The numprint package is the only place I've seen in TeX that attempts to recognize the association, but it doesn't seem quite right to only support the feature only with numprint.
Let me see if I can get some initial support built in, this weekend.
actually, it's a bit of a mess, but there's initial support for this now. It isn't really customizable, but I've copied the languages for which numprint gives special decimal and thousands separators. Hopefully that will work for you for the time being.
cool thanks.
@brucemiller What was the commit? Did you use https://en.wikipedia.org/wiki/Decimal_mark#Hindu.E2.80.93Arabic_numeral_system ?
Mainly that was in be42fbdfd1a5497cda3e31c339a1e51fba9d7c05. It's pretty clumsy.
No, I didn't use that page, but it's a good resource. The current code keys the switch on the ISO language code (with a short, hand-coded list: en, de, fr, nl, pt). So I'd have to convert that page's list into country codes. That's kinda tedious and I haven't got time now, but if you did... :>
I guess there are also cases where the region suffix matters, so I'd have to figure out a way to deal with that.
I guess there are also cases where the region suffix matters, so I'd have to figure out a way to deal with that.
Apparently there are, since I noticed that Spanish countries are in both lists.
Apparently there are, since I noticed that Spanish countries are in both lists.
Spanish-speaking I meant... ;-)
ok; almost had to send you to diversity training!
Here is something I stumbled over today and I would like to discuss this with you (@dginev may also find this interesing): When I use
$0.717$
in LaTeX, I getwhich is very nice. But when I am in German, and I use
$0,717
I getnot so nice. There is a language discrepancy between the two. I am not sure that we want to parse $0,717$ as a float in English. Especially, since you can use dots as a divider for natural numbers in German e.g.
1.000.000.000
for a billion. This is ''not'' a float.We have been talking about language-specific markers for hyphenation recently, so becoming more language-aware may have a good side here.