brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
939 stars 100 forks source link

Problem handling German floats. #548

Closed kohlhase closed 9 years ago

kohlhase commented 9 years ago

Here is something I stumbled over today and I would like to discuss this with you (@dginev may also find this interesing): When I use $0.717$ in LaTeX, I get

<Math mode="inline" xml:id="p1.m1" tex="0.717" text="0.717">
  <XMath>
    <XMTok meaning="0.717" role="NUMBER">
      0.717
    </XMTok>
  </XMath>
</Math>

which is very nice. But when I am in German, and I use $0,717 I get

<Math mode="inline" xml:id="p1.m1" tex="0,717" text="list@(0, 717)">
  <XMath>
    <XMDual>
      <XMApp>
    <XMTok meaning="list"/>
    <XMRef idref="XM1"/>
    <XMRef idref="XM2"/>
      </XMApp>
      <XMWrap>
    <XMTok meaning="0" role="NUMBER" xml:id="XM1">0</XMTok>
    <XMTok role="PUNCT">,</XMTok>
    <XMTok meaning="717" role="NUMBER" xml:id="XM2">717</XMTok>
      </XMWrap>
    </XMDual>
  </XMath>
</Math>

not so nice. There is a language discrepancy between the two. I am not sure that we want to parse $0,717$ as a float in English. Especially, since you can use dots as a divider for natural numbers in German e.g. 1.000.000.000 for a billion. This is ''not'' a float.

We have been talking about language-specific markers for hyphenation recently, so becoming more language-aware may have a good side here.

brucemiller commented 9 years ago

Are these numbers appearing in a context (say Babel), where the current language is something that could be determined? If so, we could reasonably have the number-parser adapt.

kohlhase commented 9 years ago

I was hoping you would say something like this, ... yes, they are in a babel context. So, yes it would be great if the number parser could adapt. I think the main realization is that we need an infrastructure where the parser can adapt to local customs. I find that interesting and hope you do too :-).

dginev commented 9 years ago

Looking at recent developments, the international standardization bodies have made the use of comma and dot as separators in numbers completely interchangeable (as long as it is consistent).

The wiki page refers to a resolution that:

The 22nd General Conference on Weights and Measures declared in 2003 that 
"the symbol for the decimal marker shall be either the point on the line or the
 comma on the line". It further reaffirmed that "numbers may be divided in groups
 of three in order to facilitate reading; neither dots nor commas are ever inserted
 in the spaces between groups".[10] This usage has therefore been recommended
 by technical organizations, such as the United States'
 National Institute of Standards and Technology.

It will probably be the context that holds the decisive cues on whether we are looking at a sequence, a grouping delimiter, or a decimal mark. The grouping delimiters are easy to spot - they are always in groups of 3, are used consistently and repeat throughout the number. If we see delimiters that don't match this, we are likely looking at a sequence of numbers.

Only the right-most mark is really ambiguous, and could be all 3 classes. Is 1,100 the number 1100, a decimal 1.1 or a sequence of two numbers 1 and 100 ? I don't think we can safely determine that without understanding the relevant parts of the context, just knowing the language helps only partially.

brucemiller commented 9 years ago

Not so fast, young whippersnapper! The recommendation was that groups be in 3's, not separated by commas or periods. But of course, people still do separate them by the opposite of the decimal marker, and they'll sometimes group them by 5's, for sure ("Table makers" tend to prefer 5); maybe other sizes. Note that the groups can also be in the fractional part! Since the most common case probably has only one group separator, it's hard to guess from just patterns, unless there also is a decimal separator...

So, 1,000.000 isn't necessarily a thousand.

dginev commented 9 years ago

Indeed, it gets harder with more scrutiny. :mag_right:

Contextual cues help. If we have a token soup containing numbers, spaces, dots and commas:

There is definitely interesting grammar work to be done here. Note: I made several updates to the comment, as I brainstormed later.

kohlhase commented 9 years ago

Yes, there is also some interesting grammar work to be done here, but in the spirit of our recent discussion on underspecified CDs, we should also think about the target representation. I am thinking of some symbol ambiguous-float and represent Bruce's example of 1,000.000 as

<apply> 
    <csymbol cd="latexml" name="ambiguous-float">
    <cn>1</cn>
    <cn>000</cn>
    <cn>000</cn>
</apply>

but now that I have written this, the <cn> feel quite wrong, since <cn>000</cn> is the same as <cn>0</cn>. In any case, we may want to have some underspecified representation here; my proposal is probably not right.

brucemiller commented 9 years ago

Having recently made some headway dealing with languages, and recording them, I notice that babel itself has no indicators of number formatting. However the numprint package does. It would be conceivable to check for the numprint settings and use the current decimal separator when parsing numbers. Maybe there could be some additional backdoor method to specify the separator.

dginev commented 9 years ago

We should also consider Brazilian among the languages where commas are used for a decimal separator, as indicated by the now closed #573.

kohlhase commented 9 years ago

Has there been any progress on this?

As much as I appreciate the general complexity of the problem, I do think we can pick some low-hanging fruit here easily. For instance, just make the current number parsing algorithm parametric in the language (which we pick up from babel, if present and otherwise just assume English), and set the separator accordingly. This "making parametric" in the language would probably make subsequent improvements easier.

kohlhase commented 9 years ago

I would really like to know about the ETA of the first level of support, otherwise we will have to work around this for the glossary.

brucemiller commented 9 years ago

It isn't inherently complex, just very awkward to find the right place to associate the language with it's decimal and thousands separators, and how to effect the control. The numprint package is the only place I've seen in TeX that attempts to recognize the association, but it doesn't seem quite right to only support the feature only with numprint.

Let me see if I can get some initial support built in, this weekend.

brucemiller commented 9 years ago

actually, it's a bit of a mess, but there's initial support for this now. It isn't really customizable, but I've copied the languages for which numprint gives special decimal and thousands separators. Hopefully that will work for you for the time being.

kohlhase commented 9 years ago

cool thanks.

fred-wang commented 9 years ago

@brucemiller What was the commit? Did you use https://en.wikipedia.org/wiki/Decimal_mark#Hindu.E2.80.93Arabic_numeral_system ?

brucemiller commented 9 years ago

Mainly that was in be42fbdfd1a5497cda3e31c339a1e51fba9d7c05. It's pretty clumsy.

No, I didn't use that page, but it's a good resource. The current code keys the switch on the ISO language code (with a short, hand-coded list: en, de, fr, nl, pt). So I'd have to convert that page's list into country codes. That's kinda tedious and I haven't got time now, but if you did... :>

I guess there are also cases where the region suffix matters, so I'd have to figure out a way to deal with that.

fred-wang commented 9 years ago

I guess there are also cases where the region suffix matters, so I'd have to figure out a way to deal with that.

Apparently there are, since I noticed that Spanish countries are in both lists.

fred-wang commented 9 years ago

Apparently there are, since I noticed that Spanish countries are in both lists.

Spanish-speaking I meant... ;-)

brucemiller commented 9 years ago

ok; almost had to send you to diversity training!