geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Switch pipeline to read and output gaf 2.2 #211

Closed dougli1sqrd closed 1 month ago

dougli1sqrd commented 3 years ago

Once the Feb release goes out, switch the pipeline to consume and produce gaf 2.2 instead of 2.1.

This ticket will contain any updates and comments concerning tweaks, code updates, tests, etc that verifies the pipeline is working in gaf 2.2.

Checklist

Once we have:

Then:

Current outstanding blocking issues:

Test:

kltm commented 3 years ago

Tagging @dustine32 @vanaukenk

kltm commented 3 years ago

@dustine32 Once the feb release goes out, we should also switch our PAINT upstreams to point to your 2.2 files.

kltm commented 3 years ago

@dougli1sqrd Alright if I edit a rolling checklist into your top comment?

dougli1sqrd commented 3 years ago

Sure thing

dougli1sqrd commented 3 years ago

So a quick look at the ontobio validate script (the main part of the pipeline parsing and "megastep", or as Seth calls it, the "kernel") I think all we would need to do is tell the GafWriter to be version 2.2 and then we'd be outputting all gaf 2.2 from the pipeline.

And at this point the pipeline GAF parsing logic is agnostic to GAF version. As the file is read it looks for the gaf-version string, and figures out what type of line to expect. Not having a version currently just sets to a default version (presently 2.1), and attempts to proceed. This behavior can be changed, as well as which version is the default.

But I believe at first glance that if we set (or we could parameterize with a command line arg) the writer version to gaf 2.2, we'll be done, for some definition of done.

dustine32 commented 3 years ago

@kltm I am so excited to start pointing to those PAINT GAF 2.2 files!

kltm commented 3 years ago

Excitement mounts. I've created a basic checklist at the top. Please add items there as you think of them.

kltm commented 3 years ago

@dougli1sqrd @dustine32 (@vanaukenk) While I have not "finalized" the release (in process), it is now done and frozen, so there is almost no possibility that we'll need to use the current code base for a redo. To give our QC/QA and downstreams as much time as possible to see what we're doing this month, please go ahead and update things to a GAF 2.2 stance.

dougli1sqrd commented 3 years ago

On the way!

dustine32 commented 3 years ago

@kltm I just switched the symlink on the PAINT server to point to the GAF 2.2 files. No changes to the datasets/paint.yaml file should be needed.

dougli1sqrd commented 3 years ago

GAF 2.2 output by default changes are present in newly release ontobio 2.3.0

kltm commented 3 years ago

I'm now updating the checklist at the top with current blocking tickets.

kltm commented 3 years ago

@dougli1sqrd Clarifying question from @ukemi :

Will http://snapshot.geneontology.org/products/annotations/mgi-prediction.gaf.gz http://snapshot.geneontology.org/products/annotations/paint_mgi.gaf also be GAF 2.2?

I would assume "yes" for the paint one, as that is coming in from internal processes, not sure about the prediction file. Although, technically speaking, neither of these are "products"...

dougli1sqrd commented 3 years ago

Ooh, the predictions are still generated by owltools. Owltools doesn't speak gaf 2.2 does it? Since it's a change in requirements on the qualifier, maybe owltools won't notice?

ukemi commented 3 years ago

Hi @dougli1sqrd. These are two of the three files we pick up from the GOC in our weekly loads. If they are moving to gaf2.2, we need to change how we parse them by the end of this week. ping @loricorbani @hdrabkin

ukemi commented 3 years ago

Just double checked this. We actually get the PAINT annotations from http://snapshot.geneontology.org/annotations/mgi.gaf.gz

That file will be changing, correct?

dougli1sqrd commented 3 years ago

Yeah that file will be changing.

Here's the GAF 2.2 spec: http://geneontology.org/docs/go-annotation-file-gaf-format-2.2/

It's pretty simple, really. It's just the qualifier field that is changing.

ukemi commented 3 years ago

Thanks @dougli1sqrd. It is a minor change, but it's important for us to allow for it and be able to parse the new qualifiers correctly. So we have to know if it will actually change. What about the prediction file? http://snapshot.geneontology.org/products/annotations/mgi-prediction.gaf.gz Do you know when the 2.2 files will show up in snapshot?

dougli1sqrd commented 3 years ago

They'll show up today, hopefully.

The standard predictions look okay I think? Here's a sample:

GO:0018215  protein phosphopantetheinylation    AnnotationPropagation   P   GO:0070737  protein-glycine ligase activity, elongating IDA
GO:0018215  protein phosphopantetheinylation    AnnotationPropagation   P   GO:0032452  histone demethylase activity    IBA
GO:0016577  histone demethylation   AnnotationPropagation   P   GO:0032452  histone demethylase activity    IBA
GO:0007186  G protein-coupled receptor signaling pathway    AnnotationPropagation   P   GO:0004930  G protein-coupled receptor activity IBA
GO:0050911  detection of chemical stimulus involved in sensory perception of smell  AnnotationPropagation   P   GO:0004984  olfactory receptor activity IBA
GO:0006357  regulation of transcription by RNA polymerase II    AnnotationPropagation   P   GO:0000981  DNA-binding transcription factor activity, RNA polymerase II-specific   IBA
GO:0050911  detection of chemical stimulus involved in sensory perception of smell  AnnotationPropagation   P   GO:0004984  olfactory receptor activity IBA
GO:0018215  protein phosphopantetheinylation    AnnotationPropagation   P   GO:0004222  metalloendopeptidase activity   IBA
GO:0006508  proteolysis AnnotationPropagation   P   GO:0004222  metalloendopeptidase activity   IBA
ukemi commented 3 years ago

This doesn't look like a gaf. It looks like the prediction mapping file. Here is the top of the current file. It's not zipped. My bad.

!gaf-version: 2.0 ! ! Date: 2021/02/02 ! ! Used ontologies and versions (optional) ! go/extensions/go-gaf go/releases/2021-01-30/extensions/go-gaf.owl ! ! Generated predictions ! MGI MGI:101761 Hmga2 GO:0006355 PMID:21873635 IBA PANTHER:PTN001155469|UniProtKB:P17096|MGI:MGI:96160|UniProtKB:P52926 P high mobility group AT-hook 2 9430083A20Rik|Hmgic|HMGI-C protein taxon:10090 20201025 GOC
MGI MGI:101833 Elk1 GO:0006357 PMID:21873635 IBA PANTHER:PTN000218930|UniProtKB:Q06546|UniProtKB:P50548|MGI:MGI:99253|UniProtKB:Q99607|MGI:MGI:1101781|FB:FBgn0003118|UniProtKB:P41161|MGI:MGI:109336|UniProtKB:P15036|FB:FBgn0000567|MGI:MGI:95554|UniProtKB:P11308|UniProtKB:P19419|UniProtKB:Q15723|UniProtKB:P32519|UniProtKB:Q9NZC4|UniProtKB:P41212|RGD:628860|UniProtKB:P50549|MGI:MGI:1341168|UniProtKB:P78545|MGI:MGI:1335079|MGI:MGI:107180|MGI:MGI:98282|UniProtKB:P28324|UniProtKB:P43268|UniProtKB:Q9Y603|FB:FBgn0000097|UniProtKB:P41970|UniProtKB:P14921|UniProtKB:Q01892|MGI:MGI:1350926 P ELK1, member of ETS oncogene family Elk-1 protein taxon:10090 20170228 GOC
MGI MGI:101877 Tcf12 GO:0006357 PMID:21873635 IBA PANTHER:PTN000927455|UniProtKB:P15923|FB:FBgn0267821|UniProtKB:Q99081|WB:WBGene00001949|MGI:MGI:98510|MGI:MGI:98506 P transcription factor 12 ALF1|bHLHb20|HEB|HEBAlt|HTF4|HTF-4|ME1|REB protein taxon:10090 20200911 GOC

dougli1sqrd commented 3 years ago

Ah I was looking in the wrong place. Here's what the currently running snapshot made (the top):

!gaf-version: 2.0
! 
! Date: 2021/02/09
! 
!  Used ontologies and versions (optional)
!   go/extensions/go-gaf    go/releases/2021-02-02/extensions/go-gaf.owl
! 
!  Generated predictions
! 
MGI MGI:101762  Elk3        GO:0006357  PMID:21873635   IBA PANTHER:PTN000218930|UniProtKB:P41161|UniProtKB:P41970|UniProtKB:P19419|MGI:MGI:1350926|UniProtKB:Q15723|MGI:MGI:107180|UniProtKB:P15036|MGI:MGI:95554|MGI:MGI:99253|UniProtKB:Q06546|UniProtKB:Q9NZC4|UniProtKB:P28324|UniProtKB:P11308|UniProtKB:P43268|MGI:MGI:1341168|UniProtKB:P41212|MGI:MGI:1101781|FB:FBgn0000567|FB:FBgn0003118|UniProtKB:P50548|MGI:MGI:98282|UniProtKB:Q9Y603|MGI:MGI:109336|MGI:MGI:1335079|FB:FBgn0000097|UniProtKB:P50549|RGD:628860|UniProtKB:Q01892|UniProtKB:P32519|UniProtKB:P78545|UniProtKB:P14921|UniProtKB:Q99607 P   ELK3, member of ETS oncogene family D430049E23Rik|Erp|Net|Sap-2 protein taxon:10090 20170228    GOC     
MGI MGI:101765  Cdk5        GO:0006468  PMID:21873635   IBA PANTHER:PTN000623091|dictyBase:DDB_G0272813|FB:FBgn0013762|RGD:70486|UniProtKB:P06493|PomBase:SPAC2F3.15|PomBase:SPAC23H4.17c|TAIR:locus:2011761|UniProtKB:O94921|SGD:S000006365|RGD:2319|UniProtKB:Q00534|RGD:621124|MGI:MGI:88351|WB:WBGene00000405|ZFIN:ZDB-GENE-081022-110|MGI:MGI:88357|ZFIN:ZDB-GENE-010131-2|PomBase:SPCC16C4.11|PomBase:SPBC11B10.09|UniProtKB:P11802|dictyBase:DDB_G0288677|SGD:S000001622|PomBase:SPBC19F8.07|MGI:MGI:104772|UniProtKB:P24941|UniProtKB:P50750|SGD:S000005963|UniProtKB:P61075|UniProtKB:Q8IJQ1|FB:FBgn0019949|CGD:CAL0000191263|PomBase:SPBC32H8.10|SGD:S000005952|FB:FBgn0005640|RGD:621120|UniProtKB:A0A1D8PDA6|UniProtKB:Q00646|FB:FBgn0263237|SGD:S000000364|UniProtKB:C9K505|RGD:70514|FB:FBgn0004106   P   cyclin-dependent kinase 5   Crk6    protein taxon:10090 20201206    GOC     
MGI MGI:101765  Cdk5        GO:0018215  PMID:21873635   IBA PANTHER:PTN000623091|dictyBase:DDB_G0272813|FB:FBgn0013762|RGD:70486|UniProtKB:P06493|PomBase:SPAC2F3.15|PomBase:SPAC23H4.17c|TAIR:locus:2011761|UniProtKB:O94921|SGD:S000006365|RGD:2319|UniProtKB:Q00534|RGD:621124|MGI:MGI:88351|WB:WBGene00000405|ZFIN:ZDB-GENE-081022-110|MGI:MGI:88357|ZFIN:ZDB-GENE-010131-2|PomBase:SPCC16C4.11|PomBase:SPBC11B10.09|UniProtKB:P11802|dictyBase:DDB_G0288677|SGD:S000001622|PomBase:SPBC19F8.07|MGI:MGI:104772|UniProtKB:P24941|UniProtKB:P50750|SGD:S000005963|UniProtKB:P61075|UniProtKB:Q8IJQ1|FB:FBgn0019949|CGD:CAL0000191263|PomBase:SPBC32H8.10|SGD:S000005952|FB:FBgn0005640|RGD:621120|UniProtKB:A0A1D8PDA6|UniProtKB:Q00646|FB:FBgn0263237|SGD:S000000364|UniProtKB:C9K505|RGD:70514|FB:FBgn0004106   P   cyclin-dependent kinase 5   Crk6    protein taxon:10090 20201206    GOC     
MGI MGI:101765  Cdk5        GO:0051726  PMID:21873635   IBA 
ukemi commented 3 years ago

Thanks Eric! So this one is still in gaf2.0. We will not change our load.

vanaukenk commented 3 years ago

Checking the WB files (input GAF2.2, output GAF2.2, output GPAD1.2), annotations look okay except for the ones with annotation extensions which are missing in the output files.

I assume that issue is being fixed with this PR: https://github.com/geneontology/go-site/pull/1618

so will continue to check other annotations until that fix percolates through.

kltm commented 3 years ago

@vanaukenk It looks like a snapshot has passed through the pipeline.

vanaukenk commented 3 years ago

Thanks @kltm I'll do some more QC checks later today.

vanaukenk commented 3 years ago

@kltm @dougli1sqrd @dustine32

I've come across two other issues, one of which may be outside the scope of GAF2.2, but I'll put them both here, just in case.

1) If groups submit annotations to root node that don't use the default root relations as defined in the spec, i.e. 'involved_in' for BP; 'enables' for MF; and 'is_active_in' for CC, it doesn't look like we're repairing those annotations. Can we do that?

2) It looks like some information originating from the PAINT GAF2.2 source file is not carried forward to the production GAF2.2 or is transformed in a way that I'm not sure makes sense. See columns L, M, and N in lines 21 and 22; 42 and 43; and 46 and 47, of my test spreadsheet. This might be outside of the GAF2.2 testing, but I wasn't sure why the information in column L doesn't go into the production file, why a PTN is used as a synonym in column M, and why protein gets transformed to gene_product in column N. I can put this into a separate ticket, if need be.

Update: after talking with @ukemi , I'd like to confirm exactly what the PAINT source file is that is used to go into production. Maybe the changes I noted above are because I'm looking at the wrong source file.

Thx; I'll continue testing.....

vanaukenk commented 3 years ago

Almost finished testing the WB files. Right now, I think the only other thing we'll need to discuss is whether we also want to repair any relations for IEA annotations. I'd like to discuss this with @pgaudet as GOA is a major source of IEAs for many groups and we need to make sure they're okay with whatever we decide to do.

ukemi commented 3 years ago

Testing MGI files I find this annotation in the src file: MGI MGI:2137630 Pkmyt1 acts_upstream_of_or_within GO:0018215 MGI:MGI:6201960|PMID:21873635 IBA PANTHER:PTN000113601|UniProtKB:C6KTB8|UniProtKB:Q9P2K8|FB:FBgn0040298|MGI:MGI:1353448|ZFIN:ZDB-GENE-050301-2|SGD:S000003723|MGI:MGI:103075|UniProtKB:Q9BQI3|UniProtKB:Q9LX30|PomBase:SPAC222.07c|ZFIN:ZDB-GENE-080422-1|MGI:MGI:1353449|PomBase:SPBC36B7.09|FB:FBgn0037327|TAIR:locus:2024780|PomBase:SPCC18B5.03|RGD:70883|FB:FBgn0011737|WB:WBGene00003970|RGD:70884|UniProtKB:Q8IL26|PomBase:SPAC20G4.03c|PomBase:SPBC660.14|UniProtKB:Q9NZJ5|dictyBase:DDB_G0272837|MGI:MGI:1353427|MGI:MGI:1341830|WB:WBGene00006988|UniProtKB:A0A0B4KHX7|SGD:S000002691|UniProtKB:P19525|UniProtKB:A0A1D8PQT9 P protein kinase, membrane associated tyrosine/threonine 1 Myt1 protein taxon:10090 2020-08-07 GOC

But I cannot find this in the mgi gaf in the annotations file.

ukemi commented 3 years ago

Note that this annotation originates in the mgi_predictions file:

MGI MGI:2137630 Pkmyt1 GO:0018215 PMID:21873635 IBA PANTHER:PTN000113601|ZFIN:ZDB-GENE-050301-2|PomBase:SPBC36B7.09|MGI:MGI:1341830|TAIR:locus:2024780|dictyBase:DDB_G0272837|UniProtKB:Q9LX30|MGI:MGI:1353449|PomBase:SPAC20G4.03c|PomBase:SPAC222.07c|PomBase:SPCC18B5.03|RGD:70883|UniProtKB:C6KTB8|WB:WBGene00006988|FB:FBgn0037327|MGI:MGI:1353448|ZFIN:ZDB-GENE-080422-1|FB:FBgn0040298|RGD:70884|UniProtKB:A0A1D8PQT9|PomBase:SPBC660.14|FB:FBgn0011737|UniProtKB:A0A0B4KHX7|UniProtKB:Q9BQI3|SGD:S000003723|SGD:S000002691|MGI:MGI:1353427|UniProtKB:Q9P2K8|UniProtKB:Q8IL26|UniProtKB:P19525|WB:WBGene00003970|MGI:MGI:103075|UniProtKB:Q9NZJ5 P protein kinase, membrane associated tyrosine/threonine 1 Myt1 protein taxon:10090 20200807 GOC

hdrabkin commented 3 years ago

is this an inference annotation? This might be treated as a duplicate when we try to get them into MGI (it's a PAINT annotation) I think we have a hard time loading such a long 'inferred from' field.

hdrabkin commented 3 years ago

That is the field gets truncated on loading and it might result in it looking like another annotation with fewer items in the field?

ukemi commented 3 years ago

It IS in the MGI source file, it IS NOT in the GOC output file. It IS in the prediction (inference) file. I suspect that part of the processing on the GOC side is to prevent tail-eating by stripping all PAINT annotations from the MGI file and then injecting them back as part of the GOC pipeline. The predictions that are based on PAINT are being stripped, but are not reinjected. Is the pipeline stripping based on PMID?

hdrabkin commented 3 years ago

Give me a few minutes to check the wiki

ukemi commented 3 years ago

The problem is not on the MGI side.

hdrabkin commented 3 years ago

BY 'MGI source file" do you mean the one MGI supplies (which if there ia a PAINT annotation, I thought it's stripped: does it use the PMID (gaudet paper) or GO_Central I wonder?

ukemi commented 3 years ago

I also notice that PAINT annotations in the MGI file have the MGI reference for the PAINT paper, MGI:MGI:6201960, but this is not injected as part of the GO pipeline, so it is missing in the file provided by the GOC. I guess this is ok, but should be noted here as technically a discrepancy.

ukemi commented 3 years ago

@hdrabkin, you are correct. If the pipeline used both the PMID and the provider to distinguish PAINT annotations then it could distinguish those directly from PAINT versus those from predictions based on the PAINT annotations. PAINT gets GO_Central and the predictions get GOC in the provider field. See my spreadsheet here: https://docs.google.com/spreadsheets/d/1kf9mvxMmY-zapsHQK9qfRt7O9OV0dc6Daes5TpyrEf0/edit#gid=712573202

lines 86 and 87 versus lines 110 and 111.

hdrabkin commented 3 years ago

yep here is how MGI pulls them

  1. Source the configuration file to establish the environment. (this is the mgi.gaf.gz in snapshot)
  2. Create annotation load (sw:annotload) input file from the GO/PAINT mirror_ftp file. rows where field 6 (DB:References(s)) == PMID:21873635 are processed < Gaudet paper) rows where MGI:xxx is of type "gene" rows where field 8 (With (or) From) contain Panther IDs are processed
  3. Call the annotation loader.
  4. Call the inferred-from cache update

So I don't see that MGI looks for GOC_Central vs GOC when MGj loads them.

ukemi commented 3 years ago

Yes, but this is the other direction, this is for our load. We will also need to consider the provider to distinguish the annotations that are directly from PAINT versus those from prediction. Unless of course the GO pipeline injected the prediction annotations into the main file and we took both PAINT and the predictions from it and no longer loaded the file from the products directory (hint hint).

ukemi commented 3 years ago

So at MGI what we call the 'GO/CFP (Component, Function and Process)' load would be rolled into the 'GO/PAINT' load and we would get both the PAINT and prediction annotations from http://snapshot.geneontology.org/annotations/mgi.gaf.gz.

hdrabkin commented 3 years ago

No we have a separate load for GO?CFP user name = "GOC" uses reference/J: from the GOC input file http://snapshot.geneontology.org/products/annotations/mgi-prediction.gaf <<< we pull from here.

ukemi commented 3 years ago

Right! What I'm saying is that IMO the best solution would be to roll all this into one load from full file of mouse annotations (noctua too). So in other words, rather than do three separate loads from the GOC (PAINT, Predictions, Noctua), we have a one-stop shop. We being MGI.

dougli1sqrd commented 3 years ago

For gorule-0000061 implementation: https://github.com/biolink/ontobio/pull/533

vanaukenk commented 3 years ago

@dougli1sqrd is gorule-0000061 implemented for the Thu Mar 4 00:01:38 PST 2021 snapshot build? QCing the WB files would suggest it's not, so I just wanted to make sure before I do any more testing. Thx.

vanaukenk commented 3 years ago

@dougli1sqrd @kltm

The headers in the GAF2.2 files produced by the pipeline don't conform to our specs :-)

Here is what's in our spec (and most groups have been very good about this formatting in the src files):

generated-by: database listed in dbxrefs.yaml date-generated: YYYY-MM-DD or YYYY-MM-DDTHH:MM

But here is what's in the annotation file produced:

!Generated by GO Central ! !Date Generated by GOC: 2021-03-05

ukemi commented 3 years ago

Also note that the date format in the header is not the same as the date format in the annotation data (presence and absence of hyphens). We recently 'fixed' this in our file.

ukemi commented 3 years ago

I noticed this morning that most of our CC annotations from Noctua will be filtered or flagged because they use the part_of relation for all CC annotations. Recently this has changed to use located_in for cellular anatomical structures and part_of for protein complexes. We will need to update all the models that were made using the former standards in order for the annotations to be up to the new annotation practice.

hdrabkin commented 3 years ago

Any way this can be computationally automated?

ukemi commented 3 years ago

I see this not only with MGI models, but in SynGO annotations as well.

ukemi commented 3 years ago

'Any way this can be computationally automated?' I hope so. It would be a lot of stuff to do by hand.