biolink / ontobio

python library for working with ontologies and ontology associations
https://ontobio.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
119 stars 30 forks source link

Taxon Curie objects are being leaked into the GPAD output #489

Closed dustine32 closed 3 years ago

dustine32 commented 3 years ago

Noticed this in goa_human.gpad from 2020-11-17 release:

UniProtKB       A0A087WT01      involved_in     GO:0050830      PMID:12682246   ECO:0000270             [Curie(namespace='NCBITaxon', identity='1280')] 20200113        UniProt

The interacting_taxon column value is the Curie object ([Curie(namespace='NCBITaxon', identity='1280')]) instead of either NCBITaxon:1280 or taxon:1280. Fortunately, this same line in the GAF looks to be rendered correctly (taxon:9606|taxon:1280):

UniProtKB       A0A087WT01      TRAV27          GO:0050830      PMID:12682246   IEP             P       T cell receptor alpha variable 27       TRAV27  protein taxon:9606|taxon:1280   20200113        UniProt

Related to #477. I think it's due to this line string-ifying a list of Curie's rather than a single Curie.

@pgaudet @kltm @cmungall This will need to be fixed and tested before the next release.

dougli1sqrd commented 3 years ago

Oh interesting. Yeah, somewhere we must be placing a list of one item into a GoAssociation.interacting_taxon rather than a single item or nothing (None).

dougli1sqrd commented 3 years ago

Downloading the release goa_human-src.gaf, I tried to duplicate the bug:

$ ontobio-parse-assocs.py -r go-ontology.json --report-md rep.md -f goa_human-src.gaf -o goa_human.gpad -v convert -t gpad
$ cat goa_human.gpad | grep Curie  | wc -l
0

And this is using master and v2.1.0 (which is the current ontobio release)

Whereas taking snapshot or release goa_human.gpad:

$ cat goa_human.snap.gpad | grep Curie | wc -l
526
dougli1sqrd commented 3 years ago

So, tracing a messed up line in snapshot, locally I see:

$ cat target/groups/goa/goa_human.gpad | grep 'PMID:16888103' | grep '33892'
UniProtKB   A1A4Y4  involved_in GO:0042742  PMID:16888103   ECO:0000315     NCBITaxon:33892 20170426    UniProt occurs_in(CL:0000235)   
UniProtKB   A1A4Y4  involved_in GO:0075044  PMID:16888103   ECO:0000315     NCBITaxon:33892 20170426    UniProt occurs_in(CL:0000235)

And on most recent completed snapshot:

UniProtKB   A1A4Y4  involved_in GO:0042742  PMID:16888103   ECO:0000315     [Curie(namespace='NCBITaxon', identity='33892')]    20170426    UniProt occurs_in(CL:0000235)   
UniProtKB   A1A4Y4  involved_in GO:0075044  PMID:16888103   ECO:0000315     [Curie(namespace='NCBITaxon', identity='33892')]    20170426    UniProt occurs_in(CL:0000235)

So these are the same lines, being passed through correctly locally on ontobio 2.1.0, but messing up in http://snapshot.geneontology.org/annotations/goa_human.gpad.gz

dougli1sqrd commented 3 years ago

We have a mechanism!

So in the pipeline after the Mega Step (where do do our standard production of GAFs and GPADs, etc) we have a step that re-downloads gaf and gpad sources produced in the mega step ("Temporary post filter" in the pipeline) and runs a gaferencer + GO Rule set over the lot to properly get gorule-0000013 set for paint files, etc. These are reuploaded to skyhook, which then later become the snapshot or release.

This step, contrary to the Mega Step, uses both the GPAD and GAF parser rather than just the GAF parser.

So this is being introduced into GPAD by the gpad parser incorrectly assigning a list of values to the GoAssociation objects. This list is then string-ified, getting our bad data.

================================================================ In the weeds explanation:

GoAssociation has a field for interacting_taxon which takes a Curie value. When parsing a GAF file, we split up the taxon field into the primary taxon and the interacting_taxon. The primary gets set onto the object field representing the GO term used in the annotation. If there is a second taxon, then it is set on the interacting_taxon field, otherwise it is set to None.

When converting a GAF line into a GoAssociation and validating the incoming taxon column we had to take care to validate each taxon element coming in, and since there were up to two, this was stored as a list once valided and parsed.

The GAF parser properly takes the second element of the list if it has one and then sets it as the interacting_taxon field in the final GoAssociation object.

The mistake in GPAD was not taking care that this was a list of values. In GPAD the confusion likely arose because there can be only up to one value in the taxon field, representing only the interacting taxon. So after confirming that the taxon field parses and validates the entire value of the parsed output is set as the GoAssociation interacting_taxon. But as this validation function is generalized for both GPAD and GAF, the output is still a list. So a valid interacting taxon is output as a list of a single value.

The fix will be extremely simple: just ensure we're grabbing the one valid item in the list, instead of the whole list itself.

kltm commented 3 years ago

Okay, confirming with @dustine32 and @dougli1sqrd that it looks like the fix has gone through (skyhook check) and this can be closed.

dougli1sqrd commented 3 years ago
$ curl -L http://snapshot.geneontology.org/annotations/goa_human.gpad.gz | gzip -dcf | grep Curie | wc -l
0
dustine32 commented 3 years ago

@pgaudet @kltm To test if the [Curie(namespace='NCBITaxon', identity='33892')] problem is fixed, I just ran @dougli1sqrd's cmd on the release candidate:

$ curl -L http://skyhook.berkeleybop.org/release/annotations/goa_human.gpad.gz | gzip -dcf | grep Curie | wc -l
0

Zero lines means there surely wouldn't be any instances of [Curie(namespace='NCBITaxon', identity='33892')] and spot-checking our example confirms the interacting taxon field is good:

UniProtKB   A1A4Y4  involved_in GO:0042742  PMID:16888103   ECO:0000315     NCBITaxon:33892 20170426    UniProt occurs_in(CL:0000235)

So this issue is fixed on release.