Closed dustine32 closed 3 years ago
Oh interesting. Yeah, somewhere we must be placing a list of one item into a GoAssociation.interacting_taxon
rather than a single item or nothing (None
).
Downloading the release goa_human-src.gaf
, I tried to duplicate the bug:
$ ontobio-parse-assocs.py -r go-ontology.json --report-md rep.md -f goa_human-src.gaf -o goa_human.gpad -v convert -t gpad
$ cat goa_human.gpad | grep Curie | wc -l
0
And this is using master and v2.1.0 (which is the current ontobio release)
Whereas taking snapshot or release goa_human.gpad:
$ cat goa_human.snap.gpad | grep Curie | wc -l
526
So, tracing a messed up line in snapshot, locally I see:
$ cat target/groups/goa/goa_human.gpad | grep 'PMID:16888103' | grep '33892'
UniProtKB A1A4Y4 involved_in GO:0042742 PMID:16888103 ECO:0000315 NCBITaxon:33892 20170426 UniProt occurs_in(CL:0000235)
UniProtKB A1A4Y4 involved_in GO:0075044 PMID:16888103 ECO:0000315 NCBITaxon:33892 20170426 UniProt occurs_in(CL:0000235)
And on most recent completed snapshot:
UniProtKB A1A4Y4 involved_in GO:0042742 PMID:16888103 ECO:0000315 [Curie(namespace='NCBITaxon', identity='33892')] 20170426 UniProt occurs_in(CL:0000235)
UniProtKB A1A4Y4 involved_in GO:0075044 PMID:16888103 ECO:0000315 [Curie(namespace='NCBITaxon', identity='33892')] 20170426 UniProt occurs_in(CL:0000235)
So these are the same lines, being passed through correctly locally on ontobio 2.1.0, but messing up in http://snapshot.geneontology.org/annotations/goa_human.gpad.gz
We have a mechanism!
So in the pipeline after the Mega Step (where do do our standard production of GAFs and GPADs, etc) we have a step that re-downloads gaf and gpad sources produced in the mega step ("Temporary post filter" in the pipeline) and runs a gaferencer + GO Rule set over the lot to properly get gorule-0000013 set for paint files, etc. These are reuploaded to skyhook, which then later become the snapshot or release.
This step, contrary to the Mega Step, uses both the GPAD and GAF parser rather than just the GAF parser.
So this is being introduced into GPAD by the gpad parser incorrectly assigning a list of values to the GoAssociation
objects. This list is then string-ified, getting our bad data.
================================================================ In the weeds explanation:
GoAssociation
has a field for interacting_taxon
which takes a Curie
value. When parsing a GAF file, we split up the taxon field into the primary taxon and the interacting_taxon
. The primary gets set onto the object
field representing the GO term used in the annotation. If there is a second taxon, then it is set on the interacting_taxon
field, otherwise it is set to None.
When converting a GAF line into a GoAssociation and validating the incoming taxon column we had to take care to validate each taxon element coming in, and since there were up to two, this was stored as a list once valided and parsed.
The GAF parser properly takes the second element of the list if it has one and then sets it as the interacting_taxon
field in the final GoAssociation
object.
The mistake in GPAD was not taking care that this was a list of values. In GPAD the confusion likely arose because there can be only up to one value in the taxon field, representing only the interacting taxon. So after confirming that the taxon field parses and validates the entire value of the parsed output is set as the GoAssociation
interacting_taxon
. But as this validation function is generalized for both GPAD and GAF, the output is still a list. So a valid interacting taxon is output as a list of a single value.
The fix will be extremely simple: just ensure we're grabbing the one valid item in the list, instead of the whole list itself.
Okay, confirming with @dustine32 and @dougli1sqrd that it looks like the fix has gone through (skyhook check) and this can be closed.
$ curl -L http://snapshot.geneontology.org/annotations/goa_human.gpad.gz | gzip -dcf | grep Curie | wc -l
0
@pgaudet @kltm To test if the [Curie(namespace='NCBITaxon', identity='33892')]
problem is fixed, I just ran @dougli1sqrd's cmd on the release candidate:
$ curl -L http://skyhook.berkeleybop.org/release/annotations/goa_human.gpad.gz | gzip -dcf | grep Curie | wc -l
0
Zero lines means there surely wouldn't be any instances of [Curie(namespace='NCBITaxon', identity='33892')]
and spot-checking our example confirms the interacting taxon field is good:
UniProtKB A1A4Y4 involved_in GO:0042742 PMID:16888103 ECO:0000315 NCBITaxon:33892 20170426 UniProt occurs_in(CL:0000235)
So this issue is fixed on release.
Noticed this in
goa_human.gpad
from 2020-11-17 release:The
interacting_taxon
column value is the Curie object ([Curie(namespace='NCBITaxon', identity='1280')]
) instead of eitherNCBITaxon:1280
ortaxon:1280
. Fortunately, this same line in the GAF looks to be rendered correctly (taxon:9606|taxon:1280
):Related to #477. I think it's due to this line string-ifying a list of
Curie
's rather than a singleCurie
.@pgaudet @kltm @cmungall This will need to be fixed and tested before the next release.