Closed dougli1sqrd closed 1 month ago
Tagging @dustine32 @vanaukenk
@dustine32 Once the feb release goes out, we should also switch our PAINT upstreams to point to your 2.2 files.
@dougli1sqrd Alright if I edit a rolling checklist into your top comment?
Sure thing
So a quick look at the ontobio validate script (the main part of the pipeline parsing and "megastep", or as Seth calls it, the "kernel") I think all we would need to do is tell the GafWriter
to be version 2.2
and then we'd be outputting all gaf 2.2 from the pipeline.
And at this point the pipeline GAF parsing logic is agnostic to GAF version. As the file is read it looks for the gaf-version
string, and figures out what type of line to expect. Not having a version currently just sets to a default version (presently 2.1), and attempts to proceed. This behavior can be changed, as well as which version is the default.
But I believe at first glance that if we set (or we could parameterize with a command line arg) the writer version to gaf 2.2, we'll be done, for some definition of done.
@kltm I am so excited to start pointing to those PAINT GAF 2.2 files!
Excitement mounts. I've created a basic checklist at the top. Please add items there as you think of them.
@dougli1sqrd @dustine32 (@vanaukenk) While I have not "finalized" the release (in process), it is now done and frozen, so there is almost no possibility that we'll need to use the current code base for a redo. To give our QC/QA and downstreams as much time as possible to see what we're doing this month, please go ahead and update things to a GAF 2.2 stance.
On the way!
@kltm I just switched the symlink on the PAINT server to point to the GAF 2.2 files. No changes to the datasets/paint.yaml file should be needed.
GAF 2.2 output by default changes are present in newly release ontobio 2.3.0
I'm now updating the checklist at the top with current blocking tickets.
@dougli1sqrd Clarifying question from @ukemi :
Will http://snapshot.geneontology.org/products/annotations/mgi-prediction.gaf.gz http://snapshot.geneontology.org/products/annotations/paint_mgi.gaf also be GAF 2.2?
I would assume "yes" for the paint one, as that is coming in from internal processes, not sure about the prediction file. Although, technically speaking, neither of these are "products"...
Ooh, the predictions are still generated by owltools. Owltools doesn't speak gaf 2.2 does it? Since it's a change in requirements on the qualifier, maybe owltools won't notice?
Hi @dougli1sqrd. These are two of the three files we pick up from the GOC in our weekly loads. If they are moving to gaf2.2, we need to change how we parse them by the end of this week. ping @loricorbani @hdrabkin
Just double checked this. We actually get the PAINT annotations from http://snapshot.geneontology.org/annotations/mgi.gaf.gz
That file will be changing, correct?
Yeah that file will be changing.
Here's the GAF 2.2 spec: http://geneontology.org/docs/go-annotation-file-gaf-format-2.2/
It's pretty simple, really. It's just the qualifier field that is changing.
Thanks @dougli1sqrd. It is a minor change, but it's important for us to allow for it and be able to parse the new qualifiers correctly. So we have to know if it will actually change. What about the prediction file? http://snapshot.geneontology.org/products/annotations/mgi-prediction.gaf.gz Do you know when the 2.2 files will show up in snapshot?
They'll show up today, hopefully.
The standard predictions look okay I think? Here's a sample:
GO:0018215 protein phosphopantetheinylation AnnotationPropagation P GO:0070737 protein-glycine ligase activity, elongating IDA
GO:0018215 protein phosphopantetheinylation AnnotationPropagation P GO:0032452 histone demethylase activity IBA
GO:0016577 histone demethylation AnnotationPropagation P GO:0032452 histone demethylase activity IBA
GO:0007186 G protein-coupled receptor signaling pathway AnnotationPropagation P GO:0004930 G protein-coupled receptor activity IBA
GO:0050911 detection of chemical stimulus involved in sensory perception of smell AnnotationPropagation P GO:0004984 olfactory receptor activity IBA
GO:0006357 regulation of transcription by RNA polymerase II AnnotationPropagation P GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific IBA
GO:0050911 detection of chemical stimulus involved in sensory perception of smell AnnotationPropagation P GO:0004984 olfactory receptor activity IBA
GO:0018215 protein phosphopantetheinylation AnnotationPropagation P GO:0004222 metalloendopeptidase activity IBA
GO:0006508 proteolysis AnnotationPropagation P GO:0004222 metalloendopeptidase activity IBA
This doesn't look like a gaf. It looks like the prediction mapping file. Here is the top of the current file. It's not zipped. My bad.
!gaf-version: 2.0
!
! Date: 2021/02/02
!
! Used ontologies and versions (optional)
! go/extensions/go-gaf go/releases/2021-01-30/extensions/go-gaf.owl
!
! Generated predictions
!
MGI MGI:101761 Hmga2 GO:0006355 PMID:21873635 IBA PANTHER:PTN001155469|UniProtKB:P17096|MGI:MGI:96160|UniProtKB:P52926 P high mobility group AT-hook 2 9430083A20Rik|Hmgic|HMGI-C protein taxon:10090 20201025 GOC
MGI MGI:101833 Elk1 GO:0006357 PMID:21873635 IBA PANTHER:PTN000218930|UniProtKB:Q06546|UniProtKB:P50548|MGI:MGI:99253|UniProtKB:Q99607|MGI:MGI:1101781|FB:FBgn0003118|UniProtKB:P41161|MGI:MGI:109336|UniProtKB:P15036|FB:FBgn0000567|MGI:MGI:95554|UniProtKB:P11308|UniProtKB:P19419|UniProtKB:Q15723|UniProtKB:P32519|UniProtKB:Q9NZC4|UniProtKB:P41212|RGD:628860|UniProtKB:P50549|MGI:MGI:1341168|UniProtKB:P78545|MGI:MGI:1335079|MGI:MGI:107180|MGI:MGI:98282|UniProtKB:P28324|UniProtKB:P43268|UniProtKB:Q9Y603|FB:FBgn0000097|UniProtKB:P41970|UniProtKB:P14921|UniProtKB:Q01892|MGI:MGI:1350926 P ELK1, member of ETS oncogene family Elk-1 protein taxon:10090 20170228 GOC
MGI MGI:101877 Tcf12 GO:0006357 PMID:21873635 IBA PANTHER:PTN000927455|UniProtKB:P15923|FB:FBgn0267821|UniProtKB:Q99081|WB:WBGene00001949|MGI:MGI:98510|MGI:MGI:98506 P transcription factor 12 ALF1|bHLHb20|HEB|HEBAlt|HTF4|HTF-4|ME1|REB protein taxon:10090 20200911 GOC
Ah I was looking in the wrong place. Here's what the currently running snapshot made (the top):
!gaf-version: 2.0
!
! Date: 2021/02/09
!
! Used ontologies and versions (optional)
! go/extensions/go-gaf go/releases/2021-02-02/extensions/go-gaf.owl
!
! Generated predictions
!
MGI MGI:101762 Elk3 GO:0006357 PMID:21873635 IBA PANTHER:PTN000218930|UniProtKB:P41161|UniProtKB:P41970|UniProtKB:P19419|MGI:MGI:1350926|UniProtKB:Q15723|MGI:MGI:107180|UniProtKB:P15036|MGI:MGI:95554|MGI:MGI:99253|UniProtKB:Q06546|UniProtKB:Q9NZC4|UniProtKB:P28324|UniProtKB:P11308|UniProtKB:P43268|MGI:MGI:1341168|UniProtKB:P41212|MGI:MGI:1101781|FB:FBgn0000567|FB:FBgn0003118|UniProtKB:P50548|MGI:MGI:98282|UniProtKB:Q9Y603|MGI:MGI:109336|MGI:MGI:1335079|FB:FBgn0000097|UniProtKB:P50549|RGD:628860|UniProtKB:Q01892|UniProtKB:P32519|UniProtKB:P78545|UniProtKB:P14921|UniProtKB:Q99607 P ELK3, member of ETS oncogene family D430049E23Rik|Erp|Net|Sap-2 protein taxon:10090 20170228 GOC
MGI MGI:101765 Cdk5 GO:0006468 PMID:21873635 IBA PANTHER:PTN000623091|dictyBase:DDB_G0272813|FB:FBgn0013762|RGD:70486|UniProtKB:P06493|PomBase:SPAC2F3.15|PomBase:SPAC23H4.17c|TAIR:locus:2011761|UniProtKB:O94921|SGD:S000006365|RGD:2319|UniProtKB:Q00534|RGD:621124|MGI:MGI:88351|WB:WBGene00000405|ZFIN:ZDB-GENE-081022-110|MGI:MGI:88357|ZFIN:ZDB-GENE-010131-2|PomBase:SPCC16C4.11|PomBase:SPBC11B10.09|UniProtKB:P11802|dictyBase:DDB_G0288677|SGD:S000001622|PomBase:SPBC19F8.07|MGI:MGI:104772|UniProtKB:P24941|UniProtKB:P50750|SGD:S000005963|UniProtKB:P61075|UniProtKB:Q8IJQ1|FB:FBgn0019949|CGD:CAL0000191263|PomBase:SPBC32H8.10|SGD:S000005952|FB:FBgn0005640|RGD:621120|UniProtKB:A0A1D8PDA6|UniProtKB:Q00646|FB:FBgn0263237|SGD:S000000364|UniProtKB:C9K505|RGD:70514|FB:FBgn0004106 P cyclin-dependent kinase 5 Crk6 protein taxon:10090 20201206 GOC
MGI MGI:101765 Cdk5 GO:0018215 PMID:21873635 IBA PANTHER:PTN000623091|dictyBase:DDB_G0272813|FB:FBgn0013762|RGD:70486|UniProtKB:P06493|PomBase:SPAC2F3.15|PomBase:SPAC23H4.17c|TAIR:locus:2011761|UniProtKB:O94921|SGD:S000006365|RGD:2319|UniProtKB:Q00534|RGD:621124|MGI:MGI:88351|WB:WBGene00000405|ZFIN:ZDB-GENE-081022-110|MGI:MGI:88357|ZFIN:ZDB-GENE-010131-2|PomBase:SPCC16C4.11|PomBase:SPBC11B10.09|UniProtKB:P11802|dictyBase:DDB_G0288677|SGD:S000001622|PomBase:SPBC19F8.07|MGI:MGI:104772|UniProtKB:P24941|UniProtKB:P50750|SGD:S000005963|UniProtKB:P61075|UniProtKB:Q8IJQ1|FB:FBgn0019949|CGD:CAL0000191263|PomBase:SPBC32H8.10|SGD:S000005952|FB:FBgn0005640|RGD:621120|UniProtKB:A0A1D8PDA6|UniProtKB:Q00646|FB:FBgn0263237|SGD:S000000364|UniProtKB:C9K505|RGD:70514|FB:FBgn0004106 P cyclin-dependent kinase 5 Crk6 protein taxon:10090 20201206 GOC
MGI MGI:101765 Cdk5 GO:0051726 PMID:21873635 IBA
Thanks Eric! So this one is still in gaf2.0. We will not change our load.
Checking the WB files (input GAF2.2, output GAF2.2, output GPAD1.2), annotations look okay except for the ones with annotation extensions which are missing in the output files.
I assume that issue is being fixed with this PR: https://github.com/geneontology/go-site/pull/1618
so will continue to check other annotations until that fix percolates through.
@vanaukenk It looks like a snapshot
has passed through the pipeline.
Thanks @kltm I'll do some more QC checks later today.
@kltm @dougli1sqrd @dustine32
I've come across two other issues, one of which may be outside the scope of GAF2.2, but I'll put them both here, just in case.
1) If groups submit annotations to root node that don't use the default root relations as defined in the spec, i.e. 'involved_in' for BP; 'enables' for MF; and 'is_active_in' for CC, it doesn't look like we're repairing those annotations. Can we do that?
2) It looks like some information originating from the PAINT GAF2.2 source file is not carried forward to the production GAF2.2 or is transformed in a way that I'm not sure makes sense. See columns L, M, and N in lines 21 and 22; 42 and 43; and 46 and 47, of my test spreadsheet. This might be outside of the GAF2.2 testing, but I wasn't sure why the information in column L doesn't go into the production file, why a PTN is used as a synonym in column M, and why protein gets transformed to gene_product in column N. I can put this into a separate ticket, if need be.
Update: after talking with @ukemi , I'd like to confirm exactly what the PAINT source file is that is used to go into production. Maybe the changes I noted above are because I'm looking at the wrong source file.
Thx; I'll continue testing.....
Almost finished testing the WB files. Right now, I think the only other thing we'll need to discuss is whether we also want to repair any relations for IEA annotations. I'd like to discuss this with @pgaudet as GOA is a major source of IEAs for many groups and we need to make sure they're okay with whatever we decide to do.
Testing MGI files I find this annotation in the src file: MGI MGI:2137630 Pkmyt1 acts_upstream_of_or_within GO:0018215 MGI:MGI:6201960|PMID:21873635 IBA PANTHER:PTN000113601|UniProtKB:C6KTB8|UniProtKB:Q9P2K8|FB:FBgn0040298|MGI:MGI:1353448|ZFIN:ZDB-GENE-050301-2|SGD:S000003723|MGI:MGI:103075|UniProtKB:Q9BQI3|UniProtKB:Q9LX30|PomBase:SPAC222.07c|ZFIN:ZDB-GENE-080422-1|MGI:MGI:1353449|PomBase:SPBC36B7.09|FB:FBgn0037327|TAIR:locus:2024780|PomBase:SPCC18B5.03|RGD:70883|FB:FBgn0011737|WB:WBGene00003970|RGD:70884|UniProtKB:Q8IL26|PomBase:SPAC20G4.03c|PomBase:SPBC660.14|UniProtKB:Q9NZJ5|dictyBase:DDB_G0272837|MGI:MGI:1353427|MGI:MGI:1341830|WB:WBGene00006988|UniProtKB:A0A0B4KHX7|SGD:S000002691|UniProtKB:P19525|UniProtKB:A0A1D8PQT9 P protein kinase, membrane associated tyrosine/threonine 1 Myt1 protein taxon:10090 2020-08-07 GOC
But I cannot find this in the mgi gaf in the annotations file.
Note that this annotation originates in the mgi_predictions file:
MGI MGI:2137630 Pkmyt1 GO:0018215 PMID:21873635 IBA PANTHER:PTN000113601|ZFIN:ZDB-GENE-050301-2|PomBase:SPBC36B7.09|MGI:MGI:1341830|TAIR:locus:2024780|dictyBase:DDB_G0272837|UniProtKB:Q9LX30|MGI:MGI:1353449|PomBase:SPAC20G4.03c|PomBase:SPAC222.07c|PomBase:SPCC18B5.03|RGD:70883|UniProtKB:C6KTB8|WB:WBGene00006988|FB:FBgn0037327|MGI:MGI:1353448|ZFIN:ZDB-GENE-080422-1|FB:FBgn0040298|RGD:70884|UniProtKB:A0A1D8PQT9|PomBase:SPBC660.14|FB:FBgn0011737|UniProtKB:A0A0B4KHX7|UniProtKB:Q9BQI3|SGD:S000003723|SGD:S000002691|MGI:MGI:1353427|UniProtKB:Q9P2K8|UniProtKB:Q8IL26|UniProtKB:P19525|WB:WBGene00003970|MGI:MGI:103075|UniProtKB:Q9NZJ5 P protein kinase, membrane associated tyrosine/threonine 1 Myt1 protein taxon:10090 20200807 GOC
is this an inference annotation? This might be treated as a duplicate when we try to get them into MGI (it's a PAINT annotation) I think we have a hard time loading such a long 'inferred from' field.
That is the field gets truncated on loading and it might result in it looking like another annotation with fewer items in the field?
It IS in the MGI source file, it IS NOT in the GOC output file. It IS in the prediction (inference) file. I suspect that part of the processing on the GOC side is to prevent tail-eating by stripping all PAINT annotations from the MGI file and then injecting them back as part of the GOC pipeline. The predictions that are based on PAINT are being stripped, but are not reinjected. Is the pipeline stripping based on PMID?
Give me a few minutes to check the wiki
The problem is not on the MGI side.
BY 'MGI source file" do you mean the one MGI supplies (which if there ia a PAINT annotation, I thought it's stripped: does it use the PMID (gaudet paper) or GO_Central I wonder?
I also notice that PAINT annotations in the MGI file have the MGI reference for the PAINT paper, MGI:MGI:6201960, but this is not injected as part of the GO pipeline, so it is missing in the file provided by the GOC. I guess this is ok, but should be noted here as technically a discrepancy.
@hdrabkin, you are correct. If the pipeline used both the PMID and the provider to distinguish PAINT annotations then it could distinguish those directly from PAINT versus those from predictions based on the PAINT annotations. PAINT gets GO_Central and the predictions get GOC in the provider field. See my spreadsheet here: https://docs.google.com/spreadsheets/d/1kf9mvxMmY-zapsHQK9qfRt7O9OV0dc6Daes5TpyrEf0/edit#gid=712573202
lines 86 and 87 versus lines 110 and 111.
yep here is how MGI pulls them
So I don't see that MGI looks for GOC_Central vs GOC when MGj loads them.
Yes, but this is the other direction, this is for our load. We will also need to consider the provider to distinguish the annotations that are directly from PAINT versus those from prediction. Unless of course the GO pipeline injected the prediction annotations into the main file and we took both PAINT and the predictions from it and no longer loaded the file from the products directory (hint hint).
So at MGI what we call the 'GO/CFP (Component, Function and Process)' load would be rolled into the 'GO/PAINT' load and we would get both the PAINT and prediction annotations from http://snapshot.geneontology.org/annotations/mgi.gaf.gz.
No we have a separate load for GO?CFP user name = "GOC" uses reference/J: from the GOC input file http://snapshot.geneontology.org/products/annotations/mgi-prediction.gaf <<< we pull from here.
Right! What I'm saying is that IMO the best solution would be to roll all this into one load from full file of mouse annotations (noctua too). So in other words, rather than do three separate loads from the GOC (PAINT, Predictions, Noctua), we have a one-stop shop. We being MGI.
For gorule-0000061 implementation: https://github.com/biolink/ontobio/pull/533
@dougli1sqrd is gorule-0000061 implemented for the Thu Mar 4 00:01:38 PST 2021 snapshot build? QCing the WB files would suggest it's not, so I just wanted to make sure before I do any more testing. Thx.
@dougli1sqrd @kltm
The headers in the GAF2.2 files produced by the pipeline don't conform to our specs :-)
Here is what's in our spec (and most groups have been very good about this formatting in the src files):
generated-by: database listed in dbxrefs.yaml date-generated: YYYY-MM-DD or YYYY-MM-DDTHH:MM
But here is what's in the annotation file produced:
!Generated by GO Central ! !Date Generated by GOC: 2021-03-05
Also note that the date format in the header is not the same as the date format in the annotation data (presence and absence of hyphens). We recently 'fixed' this in our file.
I noticed this morning that most of our CC annotations from Noctua will be filtered or flagged because they use the part_of relation for all CC annotations. Recently this has changed to use located_in for cellular anatomical structures and part_of for protein complexes. We will need to update all the models that were made using the former standards in order for the annotations to be up to the new annotation practice.
Any way this can be computationally automated?
I see this not only with MGI models, but in SynGO annotations as well.
'Any way this can be computationally automated?' I hope so. It would be a lot of stuff to do by hand.
Once the Feb release goes out, switch the pipeline to consume and produce gaf 2.2 instead of 2.1.
This ticket will contain any updates and comments concerning tweaks, code updates, tests, etc that verifies the pipeline is working in gaf 2.2.
Checklist
Once we have:
Then:
master
(@dougli1sqrd ) (done by default in go-site and ontobio)snapshot
andrelease
(@kltm) (done by default in go-site and ontobio)Current outstanding blocking issues:
Test:
failed, fix in progress above