biolink / ontobio

python library for working with ontologies and ontology associations
https://ontobio.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
120 stars 30 forks source link

Annotation properties should allow multiple values of the same key #513

Closed dustine32 closed 3 years ago

dustine32 commented 3 years ago

For this GPAD 2.0 line:

MGI:MGI:97591           RO:0001025      GO:0005929      MGI:MGI:3654822|PMID:16687649   ECO:0000314        2013-02-06      MGI     BFO:0000050(CL:0000019) creation-date=2008-09-25|modification-date=2013-02-06|contributor-id=http://orcid.org/0000-0003-2689-5511|contributor-id=http://orcid.org/0000-0002-9796-7693

Where annotation properties column (col 12) contains two contributor-id key-value pairs:

contributor-id=http://orcid.org/0000-0003-2689-5511|contributor-id=http://orcid.org/0000-0002-9796-7693

Parsers, writers, and the GoAssociation model should support both of these contributor-id values. But currently we just construct a dictionary by iterating through the properties, overwriting any previously encountered keys (e.g. contributor-id, comment): https://github.com/biolink/ontobio/blob/de97ea7a59ff3d7d3305e941742b5790b264c25f/ontobio/io/gpadparser.py#L462 This is kind of like:

properties["contributor-id"] = "http://orcid.org/0000-0003-2689-5511"
properties["contributor-id"] = "http://orcid.org/0000-0002-9796-7693"

As @dougli1sqrd suggested, we could switch this dictionary to just be a list of key-value tuples, similar to a few lines before this: https://github.com/biolink/ontobio/blob/de97ea7a59ff3d7d3305e941742b5790b264c25f/ontobio/io/gpadparser.py#L446 But now like:

properties_list = [tuple(prop.split("=", maxsplit=1)) for prop in gpad_line[11].split("|") if prop]

And this would retain the multiple contributer-id data:

[
   ('creation-date', '2008-09-25'), 
   ('modification-date', '2013-02-06'), 
   ('contributor-id', 'http://orcid.org/0000-0003-2689-5511'), 
   ('contributor-id', 'http://orcid.org/0000-0002-9796-7693')
]

Of course, any code that uses annotation properties (e.g. GPAD writers, gocamgen) will have to adjust to the new DS.

dougli1sqrd commented 3 years ago

This is great!

I was initially thinking we could even just use the original line you linked of mine with the list comprehension, but using tuple explicitly and and ensuring we only split once are great improvements.

Turning into a string again will also still be super easy. The only thing that has to change in GoAssociation is the type signature I think?