AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing
Other
7 stars 24 forks source link

Valid ISO8601 date not parsed #279

Closed ansell closed 5 years ago

ansell commented 5 years ago

This date is not recognised as a valid ISO8601 date on a record, and the record is then labelled as not having a date:

2013-11-06T19:59:14.961

A full, standard ISO8601 date parser, as is already provided with JVMs, to replace the ad-hoc parsing currently used would be useful to fix these issues in general for all data sources that follow the specification.

ansell commented 5 years ago

Another example of an ISO8601 date that isn't being parsed currently is:

2018-09-19T08:50+1000

Using the standard Java-8 JVM date parser should fix all of the cases where the date is actually valid and then other corner cases where there are slight deviations from the specification can be identified and parsed separately.

ansell commented 5 years ago

Instead of using an actual ISO 8601 Date parser, someone has taken the non-exhaustive list of example formats from the Darwin Core Terms specification and hardcoded support for those exact format strings using a custom parser.

Need to rewrite to support arbitrary valid formats.

djtfmartin commented 5 years ago

I think its possible the current parser pre-dates the inclusion of the better ISO parsing support in the JDK (think Java 6 or earlier). Either way, Im all for replacing it with either standard java or a separate dedicated OS library. GBIF of course have their own parsing, perhaps we can just re-use this.

Mesibov commented 5 years ago

"GBIF of course have their own parsing, perhaps we can just re-use this."

Umm...

"Losses of date information were common and evidently due to process- ing rules written to deal with various date formats. In the modified field in the NZAC dataset, for example, GBIF successfully parsed 4765 entries in YYYY- MM-DDTHH:MM:SS+12:00 format, but deleted 97,327 entries in YYYY-MM- DDTHH:MM:SS.sss+12:00 format (95% data loss). This failure may explain why GBIF did not delete the earlier versions of the 1186 duplicated records (see Methods), as both the earlier and later versions of these records have modified entries in YYYY- MM-DDTHH:MM:SS.sss+12:00 format."

https://doi.org/10.3897/zookeys.751.24791

ansell commented 5 years ago

There is a pull request for this at https://github.com/AtlasOfLivingAustralia/biocache-store/pull/290 I added regression tests for cases that were allowed for previously in code but didn't have tests, and added tests the two ISO date forms that were issues in my two recent data resource loads.