Unidata / UDUNITS-2

API and utility for arithmetic manipulation of units of physical quantities
http://www.unidata.ucar.edu/software/udunits
Other
62 stars 36 forks source link

udunits2 grammar doesn't reflect the implementation #81

Open pelson opened 5 years ago

pelson commented 5 years ago

I'm working on the udunits2 grammar for a situation where I'd like to produce LaTeX representation of an un-interpreted udunits2 valid string (ref). To be clear, I do mean un-interpreted here - km km-1 and km/km should both produce something like \frac{km, km}, which I believe rules out using the actual ut_parse parser (happy to hear otherwise!).

I've found a number of cases with the documented grammar that should fail to produce a successful ut_unit. In most cases the behaviour of udunits-2 is the correct thing, and the documented grammar is just wrong.

Cases of incorrect grammar identified:

  1. ~Shift spec words must have leading spaces. For example m from2 is valid, but mfrom2 is not, yet m@2 is fine.~

     <shift_op>: one of
            "@"
            "after"
            "from"
            "since"
            "ref"

    ~should be~

     <shift_op>: one of
             "@"
             " after"
             " from"
             " since"
             " ref"

    ~(same is true for per and PER).~ EDIT: I was wrong about this. I got my identifiers wrong.

  1. The grammar states that "ISO-8859-1 alphabetic characters" may be part of <id> (via <alpha>), but it isn't clear that other characters may also work (e.g. π) (I think I'm right in saying that π isn't in ISO-8859-1, but unicode has never been my strong suit).

  2. CLOCK is documented as <hour> ":" <minute> (":" <second>)? but it looks like it is really <hour> (":" <minute> (":" <second>)?)?. (Does this happen because of the packed_clock format?

  3. There is no mention of the special cases of UTC, Z and GMT for the case DATE CLOCK ID seen in https://github.com/Unidata/UDUNITS-2/blob/v2.2.27.6/lib/parser.y#L447-L451.

  4. TIMSTAMP -> TIMESTAMP (typo)

Cases that udunits might be doing the wrong thing:

  1. It seems that ut_parse can't handle unicode exponents greater than 3 for non numeric values. is fine but m⁴ is not. Interestingly, ut_format produces m⁴ for an input of m+4 (as expected). 2⁴ works just fine though (as does 2⁻⁴²).

  2. ~The grammar states that:~

    <second>:
              (<minute>|60) (\.[0-9]*)?

    ~But I can't see that udunits is actually enforcing this:~

    $ udunits2 -H 's since 1990-1-1 0:0:61' -W 's since 1990-1-1 0:0:0'
    1 s since 1990-1-1 0:0:61 = -3593 (s since 1990-1-1 0:0:0)
    x/(s since 1990-1-1 0:0:0) = (x/(s since 1990-1-1 0:0:61)) - 3594

    ~The same appears to be true for all other clamped timestamp components.~

    UPDATE: It seems that s since 1990-1-1 0:0:62 is actually identified as s since 1990-1-1 0:0:06 +2(hours), which is definitely valid as part of the grammar (but is that the behaviour that was intended?)

  1. ut_parse reads s since 199022T1 as s @ 19911003T010000.00000000 UTC (that's s @ 1991-10-03). Given the definition of <month> ("0"?[1-9]|1[0-2]) I was expecting this to be 1990-02-02, though to be honest I would have preferred it to fail.

I'm raising this issue as I will keep track of what I found here, and so that I can start the ball rolling with having a machine&human readable grammar that can be tested systematically (either here or upstream in a project like cf-units). My intention is to re-create a grammar based on the ANTRL specification - the choice is somewhat arbitrary, but ANTRL does allow a number of useful tools, including multi-language support (pretty useful for testing!) and debugging/visualisation of the grammar (the latter I've not yet gotten working on my machine though 😞). Naturally I'm aware of the Lex-Yacc content of the udunits-2 codebase, but have found very few tools other than bison for working with the format.

I hope you don't find this issue to be pernickety - that is definitely not my intention! My main question is: Do you support me updating the documented grammar to be a readable AND machine/testable ANTLR grammar (subject to readability, of course)?