I'm working on the udunits2 grammar for a situation where I'd like to produce LaTeX representation of an un-interpreted udunits2 valid string (ref). To be clear, I do mean un-interpreted here - km km-1 and km/km should both produce something like \frac{km, km}, which I believe rules out using the actual ut_parse parser (happy to hear otherwise!).
I've found a number of cases with the documented grammar that should fail to produce a successful ut_unit. In most cases the behaviour of udunits-2 is the correct thing, and the documented grammar is just wrong.
Cases of incorrect grammar identified:
~Shift spec words must have leading spaces. For example m from2 is valid, but mfrom2 is not, yet m@2 is fine.~
<shift_op>: one of
"@"
"after"
"from"
"since"
"ref"
~should be~
<shift_op>: one of
"@"
" after"
" from"
" since"
" ref"
~(same is true for per and PER).~
EDIT: I was wrong about this. I got my identifiers wrong.
The grammar states that "ISO-8859-1 alphabetic characters" may be part of <id> (via <alpha>), but it isn't clear that other characters may also work (e.g. π) (I think I'm right in saying that π isn't in ISO-8859-1, but unicode has never been my strong suit).
CLOCK is documented as <hour> ":" <minute> (":" <second>)? but it looks like it is really <hour> (":" <minute> (":" <second>)?)?. (Does this happen because of the packed_clock format?
Cases that udunits might be doing the wrong thing:
It seems that ut_parse can't handle unicode exponents greater than 3 for non numeric values. m³ is fine but m⁴ is not. Interestingly, ut_format produces m⁴ for an input of m+4 (as expected). 2⁴ works just fine though (as does 2⁻⁴²).
~The grammar states that:~
<second>:
(<minute>|60) (\.[0-9]*)?
~But I can't see that udunits is actually enforcing this:~
$ udunits2 -H 's since 1990-1-1 0:0:61' -W 's since 1990-1-1 0:0:0'
1 s since 1990-1-1 0:0:61 = -3593 (s since 1990-1-1 0:0:0)
x/(s since 1990-1-1 0:0:0) = (x/(s since 1990-1-1 0:0:61)) - 3594
~The same appears to be true for all other clamped timestamp components.~
UPDATE: It seems that s since 1990-1-1 0:0:62 is actually identified as s since 1990-1-1 0:0:06 +2(hours), which is definitely valid as part of the grammar (but is that the behaviour that was intended?)
ut_parse reads s since 199022T1 as s @ 19911003T010000.00000000 UTC (that's s @ 1991-10-03). Given the definition of <month> ("0"?[1-9]|1[0-2]) I was expecting this to be 1990-02-02, though to be honest I would have preferred it to fail.
I'm raising this issue as I will keep track of what I found here, and so that I can start the ball rolling with having a machine&human readable grammar that can be tested systematically (either here or upstream in a project like cf-units). My intention is to re-create a grammar based on the ANTRL specification - the choice is somewhat arbitrary, but ANTRL does allow a number of useful tools, including multi-language support (pretty useful for testing!) and debugging/visualisation of the grammar (the latter I've not yet gotten working on my machine though 😞). Naturally I'm aware of the Lex-Yacc content of the udunits-2 codebase, but have found very few tools other than bison for working with the format.
I hope you don't find this issue to be pernickety - that is definitely not my intention!
My main question is: Do you support me updating the documented grammar to be a readable AND machine/testable ANTLR grammar (subject to readability, of course)?
I'm working on the udunits2 grammar for a situation where I'd like to produce LaTeX representation of an un-interpreted udunits2 valid string (ref). To be clear, I do mean un-interpreted here -
km km-1
andkm/km
should both produce something like\frac{km, km}
, which I believe rules out using the actualut_parse
parser (happy to hear otherwise!).I've found a number of cases with the documented grammar that should fail to produce a successful
ut_unit
. In most cases the behaviour of udunits-2 is the correct thing, and the documented grammar is just wrong.Cases of incorrect grammar identified:
~Shift spec words must have leading spaces. For example
m from2
is valid, butmfrom2
is not, yetm@2
is fine.~~should be~
~(same is true for
per
andPER
).~ EDIT: I was wrong about this. I got my identifiers wrong.The grammar states that "ISO-8859-1 alphabetic characters" may be part of
<id>
(via<alpha>
), but it isn't clear that other characters may also work (e.g.π
) (I think I'm right in saying that π isn't in ISO-8859-1, but unicode has never been my strong suit).CLOCK
is documented as<hour> ":" <minute> (":" <second>)?
but it looks like it is really<hour> (":" <minute> (":" <second>)?)?
. (Does this happen because of the packed_clock format?There is no mention of the special cases of
UTC
,Z
andGMT
for the caseDATE CLOCK ID
seen in https://github.com/Unidata/UDUNITS-2/blob/v2.2.27.6/lib/parser.y#L447-L451.TIMSTAMP
->TIMESTAMP
(typo)Cases that udunits might be doing the wrong thing:
It seems that
ut_parse
can't handle unicode exponents greater than 3 for non numeric values.m³
is fine butm⁴
is not. Interestingly,ut_format
producesm⁴
for an input ofm+4
(as expected).2⁴
works just fine though (as does2⁻⁴²
).~The grammar states that:~
~But I can't see that udunits is actually enforcing this:~
~The same appears to be true for all other clamped timestamp components.~
UPDATE: It seems that
s since 1990-1-1 0:0:62
is actually identified ass since 1990-1-1 0:0:06 +2(hours)
, which is definitely valid as part of the grammar (but is that the behaviour that was intended?)ut_parse
readss since 199022T1
ass @ 19911003T010000.00000000 UTC
(that'ss @ 1991-10-03
). Given the definition of<month>
("0"?[1-9]|1[0-2]
) I was expecting this to be1990-02-02
, though to be honest I would have preferred it to fail.I'm raising this issue as I will keep track of what I found here, and so that I can start the ball rolling with having a machine&human readable grammar that can be tested systematically (either here or upstream in a project like cf-units). My intention is to re-create a grammar based on the ANTRL specification - the choice is somewhat arbitrary, but ANTRL does allow a number of useful tools, including multi-language support (pretty useful for testing!) and debugging/visualisation of the grammar (the latter I've not yet gotten working on my machine though 😞). Naturally I'm aware of the Lex-Yacc content of the
udunits-2
codebase, but have found very few tools other than bison for working with the format.I hope you don't find this issue to be pernickety - that is definitely not my intention! My main question is: Do you support me updating the documented grammar to be a readable AND machine/testable ANTLR grammar (subject to readability, of course)?