Open durack1 opened 2 years ago
A PR that catches this, and adds more leeway in the formatting (consistent with CF) would be welcome.
@jswhit I am a little confused how the regex following the CF standards at https://github.com/Unidata/cftime/blob/master/src/cftime/_cftime.pyx#L45-L47 isn't catching this, it's an out-of-bounds character
I'm unfamiliar with this library, so would need some pointers to get started
Sorry, can't be of much help - as the comment says this regex was lifted from http://delete.me.uk/2005/03/iso8601.html but apparently that link no longer exists. I know almost nothing about regexes.
Hi - I've looked at this regex in the (dim and distant) past - happy to have take a look here, if that's OK.
The quick test below shows the regex passing in the "letter O" case, but has too many None
s in its matched groups. I'll have a look at whether the regex could/should be updated, or if a check in cpdef _parse_date(datestring)
(https://github.com/Unidata/cftime/blob/v1.6.2rel/src/cftime/_cftime.pyx#L750), which is where the regex is used, might be an alternative.
>>> import re
>>> ISO8601_REGEX = re.compile(r"(?P<year>[+-]?[0-9]+)(-(?P<month>[0-9]{1,2})(-(?P<day>[0-9]{1,2})"
r"(((?P<separator1>.)(?P<hour>[0-9]{1,2}):(?P<minute>[0-9]{1,2})(:(?P<second>[0-9]{1,2})(\.(?P<fraction>[0-9]+))?)?)?"
r"((?P<separator2>.?)(?P<timezone>Z|(([-+])([0-9]{2})((:([0-9]{2}))|([0-9]{2}))?)))?)?)?)?"
) # From https://github.com/Unidata/cftime/blob/v1.6.2rel/src/cftime/_cftime.pyx#L45-L48
>>> ISO8601_REGEX.match('2001-01-01').groups() # All numbers
('2001',
'-01-01',
'01',
'-01',
'01',
'',
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None)
>>> ISO8601_REGEX.match('20O1-01-01').groups() # Letter O
('20',
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None)
Thanks @davidhassell! I hit some weird duplication issue in more simple regex I was using, it validated correctly, but then seemed to cycle over things again and again - https://regex101.com/ was really helpful in pointing that out to me
I've found a delightful edge case that is a little hard to believe. It involves a netcdf
time:units
that includes a character outside of the [0-9,-] range. If it's not obvious from the below, the issue is that thetime:units = "days since 20O1-1-1"
whereas this should betime:units = "days since 2001-1-1"
(so replacing the rogue "O" (oooh), with the numeral "0" zero).The file is a 297MiB file downloadable from here
Below is the example reproducing the error:
I wonder if a regex check would be useful to implement? This problem tripped me up for a while, and it was not at all obvious that an incorrect character (which looks almost identical, depending on fonts) was the root cause. Testing for a datestring that matches regex
r"(?:[0-9][0-9])?[0-9][0-9]-(?:[0-1])?[0-9]-(?:[0-3])?[0-9]"
could be a useful test to catch such a fringe case - and point out the issue obviously in the error message. It seems in the CF Conventions docs that there is little leeway in this format, so using"/"
or alternativeMM-DD-YYYY
formats to the standard[YYY]Y-[M]M-[D]D HH:MM:SS.ss [-]0:00
And just because https://github.com/pydata/xarray/discussions/7144