geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

gorule-0000059 Upgrade 2.1 GAF files with GAF 2.2 default formula #1558

Closed kltm closed 1 year ago

kltm commented 4 years ago

Proposed action as GORULE:0000059: upgrade 2.1 GAF files with GAF 2.2 default formula

If an annotation line does not have a gp2term qualifier, the above formula should be used to add the information.

This is specifically for dealing with 2.1--2.2 files without gp2term relation will still be treated as an error.

dougli1sqrd commented 4 years ago

This will be implemented in the GAF 2.1 parser, and the rule message will be placed in the report for gorule-0000059. While normally gorule-0000001 is used during parsing, we will do this just to let people know we have upgraded their relations/qualifiers. This will be a warning.

pgaudet commented 4 years ago

You will be doing a repair, wont you ? Then it's not a warning ?

kltm commented 4 years ago

Noting here that we'll want to look at gafrencer as well once the rules are finalized. This is not an issue in the current flow of the pipeline, but it's good to keep things aligned. @balhoff

kltm commented 4 years ago

Noting that this will not technically be "necessary" until the March, but must be implemented and tested before then.

dougli1sqrd commented 3 years ago

So just to clarify the above rule for CC: If the GO Term is subClassOf GO:0110165 "cellular anatomical entity" then we use relation "located in". If the GO Term is subClassOf GO:0032991 "protein-containing complex" then we use "part of".

In Ontobee, it looks like GO:0019012 "virion", GO:0044217 "other organism part", GO:0044423 "virion part" are also direct children of Cellular Component. What should the relation be if it's not in the above two subClassOf closures?

kltm commented 3 years ago

@dougli1sqrd I believe a better way of formulating the instruction is: CC protein-containing complexes and children: part_of Everything else: located_in

dustine32 commented 3 years ago

Yeah, that @kltm "is-complex/is-not-complex" logic is pretty much what I do for the PAINT GAF 2.2 files.

dougli1sqrd commented 3 years ago

That makes sense

kltm commented 3 years ago

It looks like with the GAF 2.2 switchover, most sources now fail our sanity checks due to severe output line reduction. Somebody might want to look a little more deeply into that, but I'm assuming that this is an expression of this ticket? https://github.com/geneontology/pipeline/issues/212

pgaudet commented 1 year ago

I need to make a separate error file format 2.1 to test this & remove tests for gorule-0000059 in the GAF 2.2 test file

pgaudet commented 1 year ago

We also need rules for direct annotations to the root

pgaudet commented 1 year ago

This is currently working correctly, as tested by the CGD file, coming in as GAF2.1, and being correctly fixed in the GO products , see http://release.geneontology.org/2023-07-27/annotations/cgd.gaf.gz

So no more action needed, since that format is not much used anymore.