geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Issues with GPADs generated from Noctua models #613

Closed tonysawfordebi closed 8 months ago

tonysawfordebi commented 6 years ago

I've just grabbed the latest batch of Noctua-generated GPADs from https://build.berkeleybop.org/job/export-lego-to-gpad-sparql/lastSuccessfulBuild/artifact/legacy/ and run them through our syntax checker.

The resulting log file is attached to this ticket.

syntax_check.log

vanaukenk commented 6 years ago

Thanks @tonysawfordebi

@ukemi and I will take a look with @balhoff to work on addressing these issues in the GPAD output.

vanaukenk commented 6 years ago

Note that there is more to the syntax check than just rules for the GPAD output.
If needed, I will make tickets elsewhere to address other issues.

vanaukenk commented 6 years ago

@tonysawfordebi Thanks again for sharing the syntax check for the Noctua GPAD files.

David, Jim, and I went over the report and will work on implementing additional checks to catch some of these errors.

Wrt the 'unsupported qualifier', the list of new qualifiers that will be allowed will be:

acts upstream of or within (RO:0002264) acts upstream of or within, positive effect (RO:0004032) acts upstream of or within, negative effect (RO:0004033) acts upstream of (RO:0002263) acts upstream of, positive effect (RO:0004034) acts upstream of, negative effect (RO:0004035)

One thought we had when discussing the new qualifiers was whether it might actually be better to use RO identifiers in Column 3 (and also in annotation extensions), rather than the term label with underscores. Annotation extensions would also require using some GOREL ids, as well. We thought this might potentially avoid having strings like this in Column 3:

acts_upstream_of_or_within,_positive_effect

Any thoughts on that proposal?

Tagging @cmungall here, too, to get his thoughts.

Thx.

tonysawfordebi commented 6 years ago

Kinda makes sense - we use GO & ECO IDs rather than labels, so it would at least be consistent with those. It wouldn't, however, be consistent with how annotation extensions are formulated.

Also, would you then have things like "NOT|RO:nnnn"?

tonysawfordebi commented 6 years ago

Oops - just re-read your comment, @vanaukenk, and saw that you'd already highlighted the extension thing.

ukemi commented 6 years ago

We were also wondering if these were not allowed because of the underscores or if they were not incorporated into the rules yet. In MGI we take them exactly as they are output in the GPAD file. I simple added them as allowed values with a little tweak for the inclusion of the comma.

vanaukenk commented 6 years ago

Yes, we also thought it was more consistent with using GO and ECO IDs.

If we did this for extensions then, yes, they would go from:

happens_during(WBls:0000109),occurs_in(WBbt:0005190)

to:

RO:0002092(WBls:0000109),BFO:0000066(WBbt:0005190)

and yes, wrt Column 3 we might then have things like:

"NOT|RO:nnnn"

unless @cmungall has another thought about that.

cmungall commented 6 years ago

In GPAD the relations are already strings. We could increment the version?

To a developer, acts_upstream_of_or_within,_positive_effect looks a bit odd, we have the underscores to make it more tokenizable friendly but then we add a ,...

balhoff commented 6 years ago

The underscore addition is just a dumb replacement on spaces in the label. I can also strip commas; discussed with @ukemi and @vanaukenk and just want to make sure someone isn't already depending on the commas (I think MGI would need to update). But I would much prefer using IDs if we can move to that, especially since the file already uses GO IDs instead of labels.

ukemi commented 6 years ago

I've already given our software group that the commas may be going. So as long as I give them the heads-up, we are ok with it.

vanaukenk commented 6 years ago

+1 for incrementing the gpad version and using a numerical relation ID wherever possible

tonysawfordebi commented 6 years ago

If we're going to go down the numerical ID route (which I think is probably A Good Thing, at least as far as the files are concerned), how does this fit with 553? Should we abandon the minor-minor-version increment and introduce gaf-version: 2.2 alongside gpa-version: 1.2 to indicate that a) additional relations are now valid in the qualifier column, and b) all relations are now identified by IDs rather than labels?

Should we add discussion of this to the agenda for the NY meeting?

Tagging @alexsign so that he's aware of this thread...

tonysawfordebi commented 6 years ago

@ukemi - (assuming that your comment was directed at me...) the reason that our parser throws the "unsupported qualifier" error is simply because it hasn't yet been updated to recognise the extended set of valid gene product-to-GO term relations.

We've been holding off doing anything until we had an agreed and approved list (and the rules that govern their usage), but it looks like we're pretty much at that stage now, so we can plough on with updating our parser & everything else that this affects.

ukemi commented 6 years ago

If we are going to go down the numerical route, I will need to let our software group know ASAP. We have already almost implemented the loads using the current specs with the term strings. I suspect this change will require a bit of work for us translating the numerical identifiers to meaningful strings to display on our web pages, especially for annotation extensions.

tonysawfordebi commented 6 years ago

Absolutely, @ukemi - it will cause MAHOOSIVE upheaval for us, and is not something that we'd be able to roll out quickly, but I do think that it's probably worth doing.

tonysawfordebi commented 6 years ago

Based on what has been said above, plus earlier discussions somewhere else (can't remember where), I reckon that this is the set of relations that we might expect to see in the qualifier column in future (GO Class is the class of GO terms with which the relation is valid):

ID Label GO Class
RO:0002327 enables GO:0003674
RO:0002331 involved_in GO:0008150
BFO:0000050 part_of GO:0005575
RO:0002325 colocalizes_with GO:0005575
RO:0002326 contributes_to GO:0003674
RO:0002263 acts_upstream_of GO:0008150
RO:0004034 acts_upstream_of_positive_effect GO:0008150
RO:0004035 acts_upstream_of_negative_effect GO:0008150
RO:0002264 acts_upstream_of_or_within GO:0008150
RO:0004032 acts_upstream_of_or_within_positive_effect GO:0008150
RO:0004033 acts_upstream_of_or_within_negative_effect GO:0008150
RO:0002428 involved_in_regulation_of GO:0008150
RO:0002429 involved_in_positive_regulation_of GO:0008150
RO:0002430 involved_in_negative_regulation_of GO:0008150
RO:0002431 involved_in_or_involved_in_regulation_of GO:0008150

How does that all look? Any glaring errors or omissions? And, more to the point, is there anything in that list that now shouldn't be?

ukemi commented 6 years ago

We will still have 'Not'.

tonysawfordebi commented 6 years ago

But won't that just be a qualifier that will be prepended to the "real" relation? In other words, no annotation should have a value in this column that is just "NOT"; it will always be "NOT|some_relation" - isn't that the direction we're heading in?

ukemi commented 6 years ago

Yes. I believe that is correct.

vanaukenk commented 6 years ago

@tonysawfordebi We reviewed the relations list on Wednesday's GO-CAM/Noctua call. The last four relations wrt regulation will not be included, and there is an additional gp2CC relation, is active in, that will be included. Here's a revised version of your table from above:

ID Label GO Class
RO:0002327 enables GO:0003674
RO:0002331 involved_in GO:0008150
BFO:0000050 part_of GO:0005575
RO:0002432 is_active_in GO:0005575
RO:0002325 colocalizes_with GO:0005575
RO:0002326 contributes_to GO:0003674
RO:0002263 acts_upstream_of GO:0008150
RO:0004034 acts_upstream_of_positive_effect GO:0008150
RO:0004035 acts_upstream_of_negative_effect GO:0008150
RO:0002264 acts_upstream_of_or_within GO:0008150
RO:0004032 acts_upstream_of_or_within_positive_effect GO:0008150
RO:0004033 acts_upstream_of_or_within_negative_effect GO:0008150

Please let me know if you have any questions/concerns. Thx.

tonysawfordebi commented 6 years ago

Thanks, @vanaukenk

tonysawfordebi commented 6 years ago

I just grabbed the latest set of GPADs and ran them through our checker, and there still seem to be quite a few issues; see the attached log file for details.

syntax_check_noctua_all.log

balhoff commented 6 years ago

@tonysawfordebi how are the valid ranges determined? For example the second error is about a tissue development that has output a white adipose tissue. That sounds good to me.

vanaukenk commented 6 years ago

@balhoff - in gorel there are property values that specify domain and range. For example, has_output has these property values:

property_value: local_domain BFO:0000015 xsd:string property_value: local_range "CHEBI:24431 GO:0032991 MI:0315 PR:000000001 SO:0000673 SO:0000704 CHEBI:33697" xsd:string

So UBERON or CL terms are not included in the local_range for has_output.

These are checks that I think we should be trying to catch upstream of Tony retrieving the file, either within Noctua or as part of the QA/QC on the overall annotation file pipeline.

vanaukenk commented 6 years ago

Perhaps in some of these cases, the correct relation to use would be 'results in development of'. @ukemi would need to take a look at these models again to see if that's the intended meaning.

tonysawfordebi commented 6 years ago

@vanaukenk is there any "how to use this relation" documentation for the new set of qualifier relations? I'd like to include some documentation links in P2G, but all I've found so far is the purls (e.g., http://purl.obolibrary.org/obo/RO_0004034) which lead to the formal definitions of the relations, but aren't exactly curator-friendly...

vanaukenk commented 6 years ago

@tonysawfordebi I think the safest place to link right now for documentation is this category page on the GO wiki: http://wiki.geneontology.org/index.php/Category:Gene_Product_to_Term_Relations

tonysawfordebi commented 6 years ago

Thanks @vanaukenk

tonysawfordebi commented 6 years ago

Has something change in the past few days regarding the way that GPADs are exported from Noctua? For example, uniprotkb.gpad has shrunk from around 4MB to about 300K, and the number of annotations attributed to SynGO has plummeted from c. 15000 to c. 1000.

cmungall commented 6 years ago

Jul 18 https://build.berkeleybop.org/job/export-lego-to-gpad-sparql/634/artifact/legacy/

image

Jul 20 https://build.berkeleybop.org/job/export-lego-to-gpad-sparql/643/artifact/legacy/

image

This is consistent with changes @dustine32 made to map SynGO mouse to MGI IDs. MGI has gone up considerably (cc @ukemi), reflecting the fact that mouse syngo annotations go here. Similar for RGD.

Side note: we need to coordinate with RGD to have these RGD annotations picked up

dustine32 commented 6 years ago

Thanks @cmungall !

ukemi commented 6 years ago

@hdrabkin noticed that there are annotations coming through in gafs that have the new qualifiers. I am tagging him here to explain their source etc.

hdrabkin commented 6 years ago

So we load gafs fro GOA (GOA mouse or GOA-human); any annotations with the new qualifiers are being rejected, as they are NOT expected in the gafs at this time. At the moment all of them are SynGO annotations. Our mgi.gpad we get from Noctua has not gone up, but I see that we haven't had a new load file since 2018-07-19 (thursday).

tonysawfordebi commented 6 years ago

@hdrabkin There's an issue with our species-specific export pipelines at the moment that is causing GPAD-specific qualifier relations to appear in GAFs. We'll squash that bug as soon as we find it.

@cmungall Ta for the explanation. I'll modify our SynGO import pipelines so that it consumes mgi.gpad as well as uniprotkb.gpad.

cc @alexsign

tonysawfordebi commented 6 years ago

@hdrabkin I've now fixed the issue regarding the rogue qualifiers and generated a new set of species-specific files, which are all available from the GOA ftp site (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa)

ukemi commented 6 years ago

@tonysawfordebi and @hdrabkin, If you aren't consuming the MGI GPAD already, I'm not sure you need to change your pipeline. If I understand correctly, the murine SynGO annotations that are coming through GO-CAM models are being added to the MGI GPAD file, which we pick up and load at MGI. Those annotations will be exported in our gaf. The thing that @hdrabkin needs to check is that the attributions are coming through correctly.

tonysawfordebi commented 6 years ago

The thing is, @ukemi, because of the way that our pipelines (and database) work, when we load the MGI file, we only process annotations that are attributed to MGI. I've also subsequently discovered that there are SynGO annotations in the RGD, FB, and WB Noctua GPADs, so pretty much the easiest thing all round is to slurp up all of the Noctua GPADs and extract all the SynGO goodness from all of them.

BTW, I haven't yet switched to using the MGI GPAD rather than the GAF, because we see fewer errors (and hence end up with more annotations) from the GAF than the GPAD. Would you be interested in seeing the reports from our checker for both files?

ukemi commented 6 years ago

I'd be particularly interested in seeing what annotations are in the gaf that are being filtered out of the gpad file. Based on the way we created our gpad, the files should be basically the same unless there is a systematic error we missed.

tonysawfordebi commented 6 years ago

OK - give me a few minutes and I'll generate some reports

tonysawfordebi commented 6 years ago

I'm having problems running the checker over the GAF at the moment, but in the meantime here's the log file from the GPAD check: syntax_check-GPAD.log

hdrabkin commented 6 years ago

Thanks Tony Now I wonder if the syngo that are now appearing in our Noctua GPAD are duplicated in your gaf? We will see on the next GOA gaf loads this weekend.

Harold

From: Tony Sawford notifications@github.com Reply-To: geneontology/go-site reply@reply.github.com Date: Tuesday, July 24, 2018 at 4:52 AM To: geneontology/go-site go-site@noreply.github.com Cc: me Harold.Drabkin@jax.org, Mention mention@noreply.github.com Subject: Re: [geneontology/go-site] Issues with GPADs generated from Noctua models (#613)

@hdrabkinhttps://github.com/hdrabkin I've now fixed the issue regarding the rogue qualifiers and generated a new set of species-specific files, which are all available from the GOA ftp site (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/geneontology/go-site/issues/613#issuecomment-407331977, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJ9NkPYtGkY_81E2L632QVRwmrfpfHs1ks5uJuBIgaJpZM4TYYTp.

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

tonysawfordebi commented 6 years ago

OK, the (EBI load balancer-related) problems have been resolved, which means I've been able to run the MGI GAF through our checker. Here's what it had to say: syntax_check-GAF.log

ukemi commented 6 years ago

@hdrabkin, we need to identify which annotations are failing the checker in GPAD that are not failing in GAF and why.

hdrabkin commented 6 years ago

I don't think we need to worry about the NDs that give 'unmapped identifiers' ; These have no UniProt mappings at MGI either; mostly refseqs. annotations using secondary ids will usually go away on the next GO load, unless they are in a qualifier, which we have (??) a check for.

tonysawfordebi commented 6 years ago

I'll have a crack at working out which annotations we load from the GAF but not from the GPAD, but it's not going to be a 5-minute job. I'll probably try loading the GPAD into our test database and then do a comparison with what's in our production db (which is still populated from the GAF). Will take a while to set up, so probably won't get round to it until tomorrow.

@hdrabkin I'm not concerned about unmapped identifier errors - it's more about why is a GAF-format annotation OK, and its GPAD counterpart not?

hdrabkin commented 6 years ago

In fact, so far I can't find any of these eg, GO:0005605; not in our system as used in the annotation or in inferred_from field. What we don't have a trigger for is if it's in an extension; these need to be hunted down example GO:0005605 in two genes somewhere; but not the gene that is flagged in your report (Hspg2); I get a hit for an extension for Grpr and Oprm1; however, although my search says these two, I can't find the id in any extension! ARGH

tonysawfordebi commented 6 years ago

(btw: the GAF we consume is this one: http://www.geneontology.org/gene-associations/gene_association.mgi.gz)

hdrabkin commented 6 years ago

The GO:0005605 is a recent merge (last week??); it is possible that has NOT made it to the public GO site I see by its header it is 06/29/2018; VERY old; supposedly a gaf is picked up daily from our FTP site, but the release to geneontology.org might be monthly?

hdrabkin commented 6 years ago

Found them: they are coming in through our GOA mouse load as an occurs_in extension. These need to be fixed at GOA_IntAct or we keep re-loading

hdrabkin commented 6 years ago

I just verified that the SynGO annotations are now in our Noctua gpad; we now have 4800+ annotations to 600 genes via Noctua