Closed tonysawfordebi closed 8 months ago
Thanks @tonysawfordebi
@ukemi and I will take a look with @balhoff to work on addressing these issues in the GPAD output.
Note that there is more to the syntax check than just rules for the GPAD output.
If needed, I will make tickets elsewhere to address other issues.
@tonysawfordebi Thanks again for sharing the syntax check for the Noctua GPAD files.
David, Jim, and I went over the report and will work on implementing additional checks to catch some of these errors.
Wrt the 'unsupported qualifier', the list of new qualifiers that will be allowed will be:
acts upstream of or within (RO:0002264) acts upstream of or within, positive effect (RO:0004032) acts upstream of or within, negative effect (RO:0004033) acts upstream of (RO:0002263) acts upstream of, positive effect (RO:0004034) acts upstream of, negative effect (RO:0004035)
One thought we had when discussing the new qualifiers was whether it might actually be better to use RO identifiers in Column 3 (and also in annotation extensions), rather than the term label with underscores. Annotation extensions would also require using some GOREL ids, as well. We thought this might potentially avoid having strings like this in Column 3:
acts_upstream_of_or_within,_positive_effect
Any thoughts on that proposal?
Tagging @cmungall here, too, to get his thoughts.
Thx.
Kinda makes sense - we use GO & ECO IDs rather than labels, so it would at least be consistent with those. It wouldn't, however, be consistent with how annotation extensions are formulated.
Also, would you then have things like "NOT|RO:nnnn"?
Oops - just re-read your comment, @vanaukenk, and saw that you'd already highlighted the extension thing.
We were also wondering if these were not allowed because of the underscores or if they were not incorporated into the rules yet. In MGI we take them exactly as they are output in the GPAD file. I simple added them as allowed values with a little tweak for the inclusion of the comma.
Yes, we also thought it was more consistent with using GO and ECO IDs.
If we did this for extensions then, yes, they would go from:
happens_during(WBls:0000109),occurs_in(WBbt:0005190)
to:
RO:0002092(WBls:0000109),BFO:0000066(WBbt:0005190)
and yes, wrt Column 3 we might then have things like:
"NOT|RO:nnnn"
unless @cmungall has another thought about that.
In GPAD the relations are already strings. We could increment the version?
To a developer, acts_upstream_of_or_within,_positive_effect
looks a bit odd, we have the underscores to make it more tokenizable friendly but then we add a ,
...
The underscore addition is just a dumb replacement on spaces in the label. I can also strip commas; discussed with @ukemi and @vanaukenk and just want to make sure someone isn't already depending on the commas (I think MGI would need to update). But I would much prefer using IDs if we can move to that, especially since the file already uses GO IDs instead of labels.
I've already given our software group that the commas may be going. So as long as I give them the heads-up, we are ok with it.
+1 for incrementing the gpad version and using a numerical relation ID wherever possible
If we're going to go down the numerical ID route (which I think is probably A Good Thing, at least as far as the files are concerned), how does this fit with 553? Should we abandon the minor-minor-version increment and introduce gaf-version: 2.2 alongside gpa-version: 1.2 to indicate that a) additional relations are now valid in the qualifier column, and b) all relations are now identified by IDs rather than labels?
Should we add discussion of this to the agenda for the NY meeting?
Tagging @alexsign so that he's aware of this thread...
@ukemi - (assuming that your comment was directed at me...) the reason that our parser throws the "unsupported qualifier" error is simply because it hasn't yet been updated to recognise the extended set of valid gene product-to-GO term relations.
We've been holding off doing anything until we had an agreed and approved list (and the rules that govern their usage), but it looks like we're pretty much at that stage now, so we can plough on with updating our parser & everything else that this affects.
If we are going to go down the numerical route, I will need to let our software group know ASAP. We have already almost implemented the loads using the current specs with the term strings. I suspect this change will require a bit of work for us translating the numerical identifiers to meaningful strings to display on our web pages, especially for annotation extensions.
Absolutely, @ukemi - it will cause MAHOOSIVE upheaval for us, and is not something that we'd be able to roll out quickly, but I do think that it's probably worth doing.
Based on what has been said above, plus earlier discussions somewhere else (can't remember where), I reckon that this is the set of relations that we might expect to see in the qualifier column in future (GO Class is the class of GO terms with which the relation is valid):
ID | Label | GO Class |
---|---|---|
RO:0002327 | enables | GO:0003674 |
RO:0002331 | involved_in | GO:0008150 |
BFO:0000050 | part_of | GO:0005575 |
RO:0002325 | colocalizes_with | GO:0005575 |
RO:0002326 | contributes_to | GO:0003674 |
RO:0002263 | acts_upstream_of | GO:0008150 |
RO:0004034 | acts_upstream_of_positive_effect | GO:0008150 |
RO:0004035 | acts_upstream_of_negative_effect | GO:0008150 |
RO:0002264 | acts_upstream_of_or_within | GO:0008150 |
RO:0004032 | acts_upstream_of_or_within_positive_effect | GO:0008150 |
RO:0004033 | acts_upstream_of_or_within_negative_effect | GO:0008150 |
RO:0002428 | involved_in_regulation_of | GO:0008150 |
RO:0002429 | involved_in_positive_regulation_of | GO:0008150 |
RO:0002430 | involved_in_negative_regulation_of | GO:0008150 |
RO:0002431 | involved_in_or_involved_in_regulation_of | GO:0008150 |
How does that all look? Any glaring errors or omissions? And, more to the point, is there anything in that list that now shouldn't be?
We will still have 'Not'.
But won't that just be a qualifier that will be prepended to the "real" relation? In other words, no annotation should have a value in this column that is just "NOT"; it will always be "NOT|some_relation" - isn't that the direction we're heading in?
Yes. I believe that is correct.
@tonysawfordebi We reviewed the relations list on Wednesday's GO-CAM/Noctua call. The last four relations wrt regulation will not be included, and there is an additional gp2CC relation, is active in, that will be included. Here's a revised version of your table from above:
ID | Label | GO Class |
---|---|---|
RO:0002327 | enables | GO:0003674 |
RO:0002331 | involved_in | GO:0008150 |
BFO:0000050 | part_of | GO:0005575 |
RO:0002432 | is_active_in | GO:0005575 |
RO:0002325 | colocalizes_with | GO:0005575 |
RO:0002326 | contributes_to | GO:0003674 |
RO:0002263 | acts_upstream_of | GO:0008150 |
RO:0004034 | acts_upstream_of_positive_effect | GO:0008150 |
RO:0004035 | acts_upstream_of_negative_effect | GO:0008150 |
RO:0002264 | acts_upstream_of_or_within | GO:0008150 |
RO:0004032 | acts_upstream_of_or_within_positive_effect | GO:0008150 |
RO:0004033 | acts_upstream_of_or_within_negative_effect | GO:0008150 |
Please let me know if you have any questions/concerns. Thx.
Thanks, @vanaukenk
I just grabbed the latest set of GPADs and ran them through our checker, and there still seem to be quite a few issues; see the attached log file for details.
@tonysawfordebi how are the valid ranges determined? For example the second error is about a tissue development
that has output
a white adipose tissue
. That sounds good to me.
@balhoff - in gorel there are property values that specify domain and range. For example, has_output has these property values:
property_value: local_domain BFO:0000015 xsd:string property_value: local_range "CHEBI:24431 GO:0032991 MI:0315 PR:000000001 SO:0000673 SO:0000704 CHEBI:33697" xsd:string
So UBERON or CL terms are not included in the local_range for has_output.
These are checks that I think we should be trying to catch upstream of Tony retrieving the file, either within Noctua or as part of the QA/QC on the overall annotation file pipeline.
Perhaps in some of these cases, the correct relation to use would be 'results in development of'. @ukemi would need to take a look at these models again to see if that's the intended meaning.
@vanaukenk is there any "how to use this relation" documentation for the new set of qualifier relations? I'd like to include some documentation links in P2G, but all I've found so far is the purls (e.g., http://purl.obolibrary.org/obo/RO_0004034) which lead to the formal definitions of the relations, but aren't exactly curator-friendly...
@tonysawfordebi I think the safest place to link right now for documentation is this category page on the GO wiki: http://wiki.geneontology.org/index.php/Category:Gene_Product_to_Term_Relations
Thanks @vanaukenk
Has something change in the past few days regarding the way that GPADs are exported from Noctua? For example, uniprotkb.gpad has shrunk from around 4MB to about 300K, and the number of annotations attributed to SynGO has plummeted from c. 15000 to c. 1000.
Jul 18 https://build.berkeleybop.org/job/export-lego-to-gpad-sparql/634/artifact/legacy/
Jul 20 https://build.berkeleybop.org/job/export-lego-to-gpad-sparql/643/artifact/legacy/
This is consistent with changes @dustine32 made to map SynGO mouse to MGI IDs. MGI has gone up considerably (cc @ukemi), reflecting the fact that mouse syngo annotations go here. Similar for RGD.
Side note: we need to coordinate with RGD to have these RGD annotations picked up
Thanks @cmungall !
@hdrabkin noticed that there are annotations coming through in gafs that have the new qualifiers. I am tagging him here to explain their source etc.
So we load gafs fro GOA (GOA mouse or GOA-human); any annotations with the new qualifiers are being rejected, as they are NOT expected in the gafs at this time. At the moment all of them are SynGO annotations. Our mgi.gpad we get from Noctua has not gone up, but I see that we haven't had a new load file since 2018-07-19 (thursday).
@hdrabkin There's an issue with our species-specific export pipelines at the moment that is causing GPAD-specific qualifier relations to appear in GAFs. We'll squash that bug as soon as we find it.
@cmungall Ta for the explanation. I'll modify our SynGO import pipelines so that it consumes mgi.gpad as well as uniprotkb.gpad.
cc @alexsign
@hdrabkin I've now fixed the issue regarding the rogue qualifiers and generated a new set of species-specific files, which are all available from the GOA ftp site (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa)
@tonysawfordebi and @hdrabkin, If you aren't consuming the MGI GPAD already, I'm not sure you need to change your pipeline. If I understand correctly, the murine SynGO annotations that are coming through GO-CAM models are being added to the MGI GPAD file, which we pick up and load at MGI. Those annotations will be exported in our gaf. The thing that @hdrabkin needs to check is that the attributions are coming through correctly.
The thing is, @ukemi, because of the way that our pipelines (and database) work, when we load the MGI file, we only process annotations that are attributed to MGI. I've also subsequently discovered that there are SynGO annotations in the RGD, FB, and WB Noctua GPADs, so pretty much the easiest thing all round is to slurp up all of the Noctua GPADs and extract all the SynGO goodness from all of them.
BTW, I haven't yet switched to using the MGI GPAD rather than the GAF, because we see fewer errors (and hence end up with more annotations) from the GAF than the GPAD. Would you be interested in seeing the reports from our checker for both files?
I'd be particularly interested in seeing what annotations are in the gaf that are being filtered out of the gpad file. Based on the way we created our gpad, the files should be basically the same unless there is a systematic error we missed.
OK - give me a few minutes and I'll generate some reports
I'm having problems running the checker over the GAF at the moment, but in the meantime here's the log file from the GPAD check: syntax_check-GPAD.log
Thanks Tony Now I wonder if the syngo that are now appearing in our Noctua GPAD are duplicated in your gaf? We will see on the next GOA gaf loads this weekend.
Harold
From: Tony Sawford notifications@github.com Reply-To: geneontology/go-site reply@reply.github.com Date: Tuesday, July 24, 2018 at 4:52 AM To: geneontology/go-site go-site@noreply.github.com Cc: me Harold.Drabkin@jax.org, Mention mention@noreply.github.com Subject: Re: [geneontology/go-site] Issues with GPADs generated from Noctua models (#613)
@hdrabkinhttps://github.com/hdrabkin I've now fixed the issue regarding the rogue qualifiers and generated a new set of species-specific files, which are all available from the GOA ftp site (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa)
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
OK, the (EBI load balancer-related) problems have been resolved, which means I've been able to run the MGI GAF through our checker. Here's what it had to say: syntax_check-GAF.log
@hdrabkin, we need to identify which annotations are failing the checker in GPAD that are not failing in GAF and why.
I don't think we need to worry about the NDs that give 'unmapped identifiers' ; These have no UniProt mappings at MGI either; mostly refseqs. annotations using secondary ids will usually go away on the next GO load, unless they are in a qualifier, which we have (??) a check for.
I'll have a crack at working out which annotations we load from the GAF but not from the GPAD, but it's not going to be a 5-minute job. I'll probably try loading the GPAD into our test database and then do a comparison with what's in our production db (which is still populated from the GAF). Will take a while to set up, so probably won't get round to it until tomorrow.
@hdrabkin I'm not concerned about unmapped identifier errors - it's more about why is a GAF-format annotation OK, and its GPAD counterpart not?
In fact, so far I can't find any of these eg, GO:0005605; not in our system as used in the annotation or in inferred_from field. What we don't have a trigger for is if it's in an extension; these need to be hunted down example GO:0005605 in two genes somewhere; but not the gene that is flagged in your report (Hspg2); I get a hit for an extension for Grpr and Oprm1; however, although my search says these two, I can't find the id in any extension! ARGH
(btw: the GAF we consume is this one: http://www.geneontology.org/gene-associations/gene_association.mgi.gz)
The GO:0005605 is a recent merge (last week??); it is possible that has NOT made it to the public GO site I see by its header it is 06/29/2018; VERY old; supposedly a gaf is picked up daily from our FTP site, but the release to geneontology.org might be monthly?
Found them: they are coming in through our GOA mouse load as an occurs_in extension. These need to be fixed at GOA_IntAct or we keep re-loading
I just verified that the SynGO annotations are now in our Noctua gpad; we now have 4800+ annotations to 600 genes via Noctua
I've just grabbed the latest batch of Noctua-generated GPADs from https://build.berkeleybop.org/job/export-lego-to-gpad-sparql/lastSuccessfulBuild/artifact/legacy/ and run them through our syntax checker.
The resulting log file is attached to this ticket.
syntax_check.log