97jaz / gregor

Date and time library for Racket
45 stars 10 forks source link

parse-date fails to parse "20191115" with pattern "yyyyMMdd" #41

Closed evdubs closed 4 years ago

evdubs commented 4 years ago

Similar to #35, I see the following with the newest code:

> (parse-date "20191115" "yyyyMMdd")
; Unable to match pattern [MM] against input "" [,bt for context]

When I do the following, it works:

> (parse-date "20191115" "20yyMMdd")
#<date 2019-11-15>
97jaz commented 4 years ago

Yes, but as far as I can tell, CLDR doesn't have a pattern for a year with exactly four digits. yyyy means at least four digits. So what's happening, is that the yyyy pattern is correctly matching the entire input, leaving nothing for MM.

Unfortunately, on the parsing side, CLDR leans heavily on the notion of lenient parsing, which I think is a bad default.

Apparently JodaTime has the same behavior, while java.time, which was based on Joda, does not, so it would be interesting to see if there was an explicit discussion about this somewhere. If that library has a principled approach to this, then I could adapt it.

I'll take a look, but the current behavior is intentional, believe it or not.

evdubs commented 4 years ago

At least for this particular example, here's what I see using JodaTime

$ jshell --class-path joda-time-2.10.5.jar
jshell> import org.joda.time.LocalDate;
jshell> import org.joda.time.format.DateTimeFormat;
jshell> LocalDate.parse("20191115", DateTimeFormat.forPattern("yyyyMMdd"));
$3 ==> 2019-11-15
97jaz commented 4 years ago

Well, that's interesting. I was just going by what I read in that post. Maybe that code was changed at some point? I was really hoping to find a mailing list discussion about this very point. I'll keep looking a bit more.

97jaz commented 4 years ago

Oh, well, now looking at the post again, the poster was using the MMM pattern for the month, which is not a numeric pattern, so it's not surprising that the pattern didn't match. That fact that it parsed a 5-digit year, though, is interesting.

But I now have an idea of how I might deal with this problem specifically. The general problem, though, isn't really solvable without fixed-length patterns.

evdubs commented 4 years ago

I see the behavior pasted above going back to joda-time-2.0.jar, which was released in July 2011. The joda-time-1.x releases don't have static parse methods attached to the date implementation, and I am having trouble finding Javadocs for them, so I am unsure how they would work.

evdubs commented 4 years ago

Maybe some kind of user control is acceptable where the user indicates "I know I am giving you a year format that can match more than 4 digits, but don't be greedy parsing the year." Kind of like regular expressions using .*? instead of .*?

97jaz commented 4 years ago

The main weakness of the current implementation is that it parses each pattern independently. If it build a single regexp for the whole pattern, it would be a lot better. Localization concerns makes this a non-trivial change, but I'm definitely going to look into it.

97jaz commented 4 years ago

But now I'm reminded that not all fields are parsed by regexp. But backtracking could be used to achieve the same goal (if rather less efficiently).

97jaz commented 4 years ago

Looks to me like Joda handles this issue as a special case. If the pattern variable following the year one is also looking for a number, then it treats the year pattern as requiring a fixed number of digits. So in yyyyMMdd, yyyy would mean "exactly four digits," whereas in yyyy-MMM-dd, it would mean "at least four digits."

This doesn't seem like a bad idea.

evdubs commented 4 years ago

Seems reasonable to me.

97jaz commented 4 years ago

Fixed by #42