biolink / ontobio

python library for working with ontologies and ontology associations
https://ontobio.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
119 stars 30 forks source link

Pass normalized date to GoAssoc for #493 #502

Closed dustine32 closed 3 years ago

dustine32 commented 3 years ago

Uses the date variable, normalized with date format (YYYYMMDD), for internal GORULE tests rather than just passing through raw column value gpad_line[8]

dustine32 commented 3 years ago

Actually, this doesn't fix for GPAD 2.0 since it's not changed in from_2_0. Also, normalizing this format (YYYY-MM-DD) will transform to incorrect format (YYYYMMDD) according to 2.0 spec.

More work needed to fix.

dustine32 commented 3 years ago

I believe https://github.com/biolink/ontobio/pull/502/commits/62327aed7835af72775ac0a6809037ebca7f22df should fix #493. I also fleshed out the test to trigger the GPAD 2.0 parse code.

Even though it doesn't affect this issue, I'm leaving the gpadparser.py change in since I assume it was intended to set the normalized date in the GoAssociation object. @dougli1sqrd What do you think?

dougli1sqrd commented 3 years ago

So the best way to solve this will be to figure out what the "truest" representation of the Date should be, and always parse from an annotation file into the real Date, in the GoAssociation. The gorule should be able to trust and rely completely on the correctness of the date format.

We could use the builtin datetime formats for this, but I fear those could be slow. In the past parsing dates was slow, and why we deferred to this semi-hard-coded string-as-array method.

Maybe for speed and expressiveness we could make a collections.namedtuple with fields like day, month, year. Although we may have to support some sub spec of ISO 8601. Maybe there's some optimization we can make that is easy to convert from a local domain type into the builtin datetime side? They definitely support ISO 8601, so probably that's where we ultimately want to go.

@kltm Where was the ticket (was there one?) describing the ISO 8601 date possible requirements for the future?

dougli1sqrd commented 3 years ago

Talking with Seth yesterday, I'll record a summary here of what we talked about.

When parsing the date from an incoming annotation file, we need to validate it, and perhaps upgrade it to the right format (YYYY-MM-DD). But during the imports, an optional THH:MM may be appended. Essentially this should be passed through for use by making GO-CAMs.

The key requirement here is that after parse, the date must be in the standardized YYYY-MM-DD[THH:MM] format. For parsing purposes though, we only care about the date portion.

So, since each annotation format essentially has its own date format (or rather, since we have more than one now, we can't assume the same format anymore), we should have a each to_association associated with a format do the correct conversion, with the assumption first that it's correct, and then falling into the builtin datetime parsing (like we do currently with gaf dates) if the simple way doesn't work.

I see us as having two choices as to what to parse into 1) retain the raw string and just pass it through 2) parse into a lightweight type that gives us some type documentation.

I think using 2 would be pretty straightforward, and would enhance readability and express intent. namedtuple is a good choice here since it really is just a tuple, which are fast in python and isn't the same as creating a full class. We have had issues in the past with slow parsing of dates in the parser before. namedtuple is really just a thin wrapper around a regular tuple so has builtin speed this way.

We could make a namedtuple AssociationDate with fields year, month, day, time. In the ideal cases where the incoming date strings are well formed, we can just use array indexing to find these fields (essentially like already happens). Then falling through to the builtin datetime parsers, to grab the right fields. The time element should just be sliced off the end, like date[10:0] which will grab the 10 index and after.

dougli1sqrd commented 3 years ago

I think this is good with my update. Unless there's anything else you want to add @dustine32, I think we should merge this.

dustine32 commented 3 years ago

@dougli1sqrd Agreed! I just now pulled, tested and everything seems to work fine. Thanks again for fixing this!