geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Update main pipeline output to produce usable GPAD/GPI 2.0 #2043

Closed kltm closed 6 months ago

kltm commented 1 year ago

This was created from a conversation @ukemi and @sierra-moxon , making explicit an implicit task.

geneontology/gopreprocess#9

sierra-moxon commented 11 months ago

Happy to meet, I'm free this afternoon: my interpretation of @ukemi's comments is that I need to include snapshot versions of http://snapshot.geneontology.org/products/upstream_and_raw_data/paint_mgi.gpad.gz and http://snapshot.geneontology.org/products/upstream_and_raw_data/mgi-prediction.gaf in the pre-pipeline code so that it adds them to the GPAD 2.0 file it produces.

leemdi commented 11 months ago

Thanks, Sierra. Let me know when a new merged_gpad file is available. (I was out yesterday).

ukemi commented 11 months ago

Just for record keeping. I have added all of the new IEA refs to the MGI database. @pgaudet these should probably be added to the GO-REF markdown pages.

GO_REF:0000024 = MGI:7550374 GO_REF:0000036 = MGI:7550381 GO_REF:0000041 = MGI:7550382 GO_REF:0000044 = MGI:7550383 GO_REF:0000104 = MGI:7550384 GO_REF:0000107 = MGI:7550385 GO_REF:0000111 = MGI:7550387 GO_REF:0000114 = MGI:7550388 GO_REF:0000116 = MGI:7550389 GO_REF:0000117 = MGI:7550391

pgaudet commented 11 months ago

@ukemi

Are the Jxxx IDs deprecated and replaced by these MGI IDs?

leemdi commented 11 months ago

No, we have both J:xxx and MGI:xxx. I added the new J:xxx to David's list. These are new References in MGI that David added on 11/21.

GO_REF:0000024 = MGI:7550374 J:342587 GO_REF:0000036 = MGI:7550381 J:342596 GO_REF:0000041 = MGI:7550382 J:342601 GO_REF:0000044 = MGI:7550383 J:342604 GO_REF:0000104 = MGI:7550384 J:342612 GO_REF:0000107 = MGI:7550385 J:342605 GO_REF:0000111 = MGI:7550387 J:342606 GO_REF:0000114 = MGI:7550388 J:342607 GO_REF:0000116 = MGI:7550389 J:342608 GO_REF:0000117 = MGI:7550391 J:342609

pgaudet commented 11 months ago

So the pipeline will export all 3 references for an annotation using any of these GO_REFs?

leemdi commented 11 months ago

which "pipeline" are you referring to? GOC or MGI?

pgaudet commented 11 months ago

GOC (I thought GOC was going to be replacing the MGI pipeline, at some point)

leemdi commented 11 months ago

then that is a question for Sierra. I think just GO_REF will be in the GOC/pipeline that is going to generate the mgd.gpad. MGI will then use the new mgd.gpad and use that file to load the data into the MGI/database. Does that answer your question?

pgaudet commented 11 months ago

Yes! Thanks

leemdi commented 11 months ago

for documentation purposes: GOC generates mgd.gpad -> MGI picks up this file -> MGI-pipeline runs the MGI-pipeline will run some MGI-sanity checks: if MGI-sanity check finds GO_REF or PubMed ids that are not in MGI, then David/Li will add them to MGI if MGI-sanity check finds UBERON ids that cannot be matched to MGI/EMAPA, then Terry will work on these. MGI-sanity check reports GO ids that are obsolete in MGI. MGI-sanity check reports duplicates that we do not want to load into MGI.

leemdi commented 11 months ago

@sierra-moxon not a big deal, but I found these typos in the mgd.gpad. This should all be: UniProtKB, I think?

UniprotkB UniprotKB UniProtKb UniPRotKB UniPROtKB UNiProtKB UnIProtKB

kltm commented 11 months ago

@leemdi @sierra-moxon What is the source for these. While we would attempt to "fix these", that would likely fall under the GO Rules checks, rather than the import step (unless this is being introduced at our end). Ideally, there would be feedback upstream to get them dealt with, otherwise GORULE:0000027 (also see https://github.com/geneontology/go-site/issues/1218).

leemdi commented 11 months ago

@sierra-moxon will there be a new sierra-file before Thursday's meeting?

sierra-moxon commented 11 months ago

My next step is to produce the GPAD I gave you, directly from the pipeline test I have running (to address Seth's comments about passing the newly generated files through the GORules checks, etc. before they end up on your plate). I will do my best to have that before Thursday.

ukemi commented 11 months ago

@sierra-moxon and @leemdi I have spent a good part of the day looking at the load that @leemdi did with the file. Things are looking pretty good on this end. In particular, with the switch of one of the GO_REFS for the SPKW load, we seem to have picked up that set of annotations now. I have focused a bit on the genes that lost annotations between our production database and the test load. Some results are linked below, but the bottom line is I am already seeing trends. Some of these will disappear when the PAINT/IBA annotations are included in @sierra-moxon 's new file. For the 'real' differences, it looks like there are some annotations missing that would be derived from the isoform file. However, the pipeline is behaving exactly as we had designed it because those annotations are from identifiers that we would not necessarily consider to be isoforms and are not in our GPI file. If we want to pick those up, the isoforms should be curated into PRO and become official annotatable objects in MGI.

https://docs.google.com/spreadsheets/d/1emrtGj2IwYSrq2_PUEMicj95hC25SsHhWuRzp_jTxPE/edit#gid=0

ukemi commented 11 months ago

I'm still seeing issues with the providers when I look at @leemdi's load. It would be easiest to look at this together on Thursday, but wanted to make a note of it.

ukemi commented 11 months ago

Also note that because of a 1->2 mapping of GO-REF with MGI ref, all of the annotations from the ISO loads, both human and rat, are being attributed to the rat load. We need to fix this as we work with GO_REF 96. We should either split that ref or @leemdi should put a step in out load to distinguish rat from human. Let's discuss on Thursday.

leemdi commented 11 months ago

My two-cents : this should be done at GOC, not at MGI. The point was for MGI to pick up one file and do very simple sanity checks. I don't think MGI should have to make the human vs rat decision. There should be a 1-1 mapping of GO-REF with MGI references.

leemdi commented 11 months ago

is this where I should go to grab the MGI gpad file, eventually...

http://current.geneontology.org/annotations/mgi.gpad.gz

thanks.

sierra-moxon commented 11 months ago

Also note that because of a 1->2 mapping of GO-REF with MGI ref, all of the annotations from the ISO loads, both human and rat, are being attributed to the rat load. We need to fix this as we work with GO_REF 96. We should either split that ref or @leemdi should put a step in out load to distinguish rat from human. Let's discuss on Thursday.

I can definitely change the pub for each ISO load.

leemdi commented 11 months ago

While testing the MGI/public pipeline, we found this bad URL. If we find a new URL, we will let you know.

Per David: Unipathways is a project that no longer being developed, but I didn’t realize the URL had gone stale. I will investigate.

https://github.com/geneontology/

ukemi commented 11 months ago

Ticket about UniPathway: https://github.com/geneontology/go-site/issues/2208

ukemi commented 11 months ago

Summary of today's call:

leemdi commented 10 months ago

side question:

which GO OBO file should we be using?

http://purl.obolibrary.org/obo/go.obohttp://purl.obolibrary.org/obo/go/snapshot/go-basic.obo

what is the difference? MGI is using the snapshot. But go.obo seems more current. why are there 2?

thanks.

ok, I found this, which I guess sort of answers my question.
https://geneontology.org/docs/download-ontology/

kltm commented 10 months ago

@leemdi Theoretically, there should be a high-frequency snapshot release (latest and greatest) and release (slow and qc'ed). Due to some somewhat recent issues we have with the pipeline (https://github.com/geneontology/pipeline/issues/316 https://github.com/geneontology/pipeline/issues/349), snapshot has fallen behind. We are currently working on fixes.

leemdi commented 10 months ago

This is LI's new meeting;

Join Zoom Meeting https://jacksonlab.zoom.us/j/89272071900?pwd=enI4dmZKYWF2VTZ5OC9OR21XQVVHZz09

leemdi commented 10 months ago

We were here:

This is LI's new meeting;

Join Zoom Meeting https://jacksonlab.zoom.us/j/89272071900?pwd=enI4dmZKYWF2VTZ5OC9OR21XQVVHZz09

I had sent this info here:

https://github.com/geneontology/go-site/issues/2043

We met and I have Sierra's new file and am trying to process it.

Lori

kltm commented 10 months ago

@leemdi Just to clarify, all meetings with GO people need to be added to the GO calendar--we generally should not coordinate stuff like this on the tracker as the information doesn't propagate. If you need edit access, please let me know.

leemdi commented 10 months ago

@LiNiMGI Li, do you know how to add meeting to the GO calendar? make sure you add next week's Wednesday meeting.

kltm commented 10 months ago

@leemdi @LiNiMGI I'm seeing that there could be permission problems here. Shall I add you both with edit permissions and, if so, I'd be using your gmail (unless MGI is Googleverse)?

LiNiMGI commented 10 months ago

Thanks @kltm , yes, please add me with edit permission.

leemdi commented 10 months ago

I have run Sierra's new mgi.gpad.gz file on my test area.

I have been using a MGI/GO QC report to compare: # of MGI Markers w/out GO Annotations

MGI Production: 268 MGI/Scrum (last run): 713 New Sierra file: 561

getting closer.
We have about 34 missing PubMed ids in MGI Production that Li will have to add to our Lit Triage pipeline. And we are missing GOREF:0000033 as well. Once we get these References added to MGI, I will run again and I expect the numbers to get better.

Many thanks.

LiNiMGI commented 10 months ago

Summary of today’s call • @sierra-moxon has a new file that passed the pipeline run. • @sierra-moxon will fix the IDs in column 1 and the providers etc. • @leemdi will try to process it; MGI is going to do some test. • We will meet again next Wednesday at 3:00pm

links to the new flies: https://build.geneontology.org/job/geneontology/job/pipeline/job/full-issue-325-gopreprocess/ http://skyhook.berkeleybop.org/full-issue-325-gopreprocess/annotations/

kltm commented 10 months ago

@LiNiMGI I believe you should now have edit permissions to the GO Calendar at your jax address.

leemdi commented 10 months ago

Issue with GOREF:0000033 is that it should be GO_REF:0000033

GOREF vs GO_REF.

So, this is an issue in mgi.gpad or whatever feeds it.

this affects 68984 annotations.

So, I can work around this in my testing, but this GOREF needs to be fixed -> GO_REF

leemdi commented 10 months ago

Fixing GOREF -> GO_REF in MGI so I can continue testing

MGI Production: 268 MGI/Scrum (last run): 713 New Sierra file: 217

which means the new Sierra file brought in more Genes with GO Annotations then in Production, which I assume is a good thing.

leemdi commented 10 months ago

MGI:2672974 Defb39, defensin beta 39, Chr 8

on MGI/Production: GO:0005576 UniProtKB-KW:KW-0964 this comes from our uniprot load

on MGI/Scrum (Lori's test area), I am not seeing the UniProt annotations that I normally see.

this GO id is not in Sierra's file: MGI:MGI:2672974 RO:0002331 GO:0002227 GOREF:0000033 ECO:0000318 PANTHER:PTN000483422|UniProtKB:P60022 2020-06-25 GO_Central MGI:MGI:2672974 RO:0002327 GO:0031731 GOREF:0000033 ECO:0000318 PANTHER:PTN000483422|UniProtKB:P60022 2020-06-25 GO_Central MGI:MGI:2672974 RO:0002331 GO:0050830 GOREF:0000033 ECO:0000318 PANTHER:PTN000483422|UniProtKB:Q6QLQ9|UniProtKB:Q6IV18|UniProtKB:P60022 2020-06-25 GO_Central MGI:MGI:2672974 RO:0002331 GO:0050829 GOREF:0000033 ECO:0000318 PANTHER:PTN000483422|UniProtKB:P60022|UniProtKB:Q6IV18|UniProtKB:Q6QLQ9 2020-06-25 GO_Central MGI:MGI:2672974 RO:0002432 GO:0005615 GOREF:0000033 ECO:0000318 PANTHER:PTN000483422|UniProtKB:P60022|UniProtKB:Q6QLQ9 2020-06-25 GO_Central

leemdi commented 10 months ago

@sierra-moxon

These are the unique field 10/assigned by, so far:

GO_Central: 213,130 MGI: 312,399 SynGO: 10,310 UniProt : 419 WB : 5

leemdi commented 10 months ago

@sierra-moxon @kltm @LiNiMGI @ukemi

Remaining issues:

  1. field 1 issues: check 12/21 noon/EST

PR:MGI:MGI: -> MGI:MGI: PR:PR: -> PR:

  1. field 10/provider/assigned by : check 12/21 noon/EST Rules are here:https://docs.google.com/spreadsheets/d/1Jyvd8Ct8ZvZpwDTTV01ZzyQezluziHS38JZgSexP5uU/edit#gid=0

  2. missing meta-data from noctua model; this is really important for MGI curators :

this can be done but Sierra needs to talk to Seth, etc.

for annotations from noctua models, field 12 is currently empty, but should contain: contributor noctua-model-id model-state (see the current Noctua GPAD1.2 files in the products directory)

  1. GOA-Mouse Isoform annotations are still missing. : check 12/21 noon/EST

For the isoform file, we should be checking the identifiers against the MGI GPI file and taking the ones that are there. A few examples we are catching in our report: UniProtKB:A2ASQ1-2 Agrn UniProtKB:D3YX90 Adamts17 UniProtKB:E9PYV8 Adamts9 UniProtKB:F7AAP4 Atp2b4

sierra-moxon commented 10 months ago

from GOA isoform file:

UniProtKB       A2ASQ1  Agrn    enables GO:0005201      PMID:22159717   RCA             F       Agrin   Agrn|Agrin      protein taxon:10090     20180725        BHF-UCL occurs_in(UBERON:0002048)       UniProtKB:A2ASQ1-2
UniProtKB       A2ASQ1  Agrn    located_in      GO:0062023      PMID:22159717   HDA             C       Agrin   Agrn|Agrin      protein taxon:10090     20180725        BHF-UCL part_of(UBERON:0002048) UniProtKB:A2ASQ1-2

matches rows in MGI gpi:

MGI:MGI:87961   Agrn    agrin   NMF380|Agrin|nmf380     SO:0001217      NCBITaxon:10090                         UniProtKB:A2ASQ1   

and

PR:A2ASQ1       mAGRN   agrin (mouse)   mAGRN   PR:000000001    NCBITaxon:10090 MGI:MGI:87961                   UniProtKB:A2ASQ1 

so that is why we have the non -2 version of this row in the output GAF files.

ukemi commented 10 months ago

Yes, this is due to the conflation of genes and proteins. The Uniprot identifer can refer to both. Specifically for the isoform file, the annotation included should only represent proteins so the mapping to the gene identifier: MGI:MGI:87961 Agrn agrin NMF380|Agrin|nmf380 SO:0001217 NCBITaxon:10090 (note the MGI id in column 1) should be ignored and the mapping to the protein identifier: should be used as the annotation object. PR:A2ASQ1 mAGRN agrin (mouse) mAGRN PR:000000001 NCBITaxon:10090 MGI:MGI:87961

ukemi commented 10 months ago

PS. It would be really nice if we could cleanly disentangle proteins fro genes in the Uniprot annotations.

ukemi commented 10 months ago

PPS. The converse is true for the non-isoform mouse file. It should only be used to generate annotations to MGI genes (MGI:MGI:#####).

ukemi commented 10 months ago

@sierra-moxon @LiNiMGI Yesterday you showed a report that contained a bunch of errors where MGI annotations were failing GO-rule 1. But when I go and look at the live report, I don't see them: http://snapshot.geneontology.org/reports/gorule-report.html

leemdi commented 10 months ago

FYI, this problem finally caught up with us at MGI/Production on Dec 17/18.

See earlier comments:

Issue with GOREF:0000033 is that it should be GO_REF:0000033

GOREF vs GO_REF.

So, this is an issue in mgi.gpad or whatever feeds it.

this affects 68984 annotations.

So, I can work around this in my testing, but this GOREF needs to be fixed -> GO_REF

ukemi commented 10 months ago

Ticket to fix this is here: https://github.com/geneontology/go-site/issues/2185#issuecomment-1860864834

sierra-moxon commented 10 months ago

I re-ran the pipeline on Dec 20, to show fixes to the issues above, and though it completed successfully, when I went searching for files this morning, I could not find them. I am re-running today.

leemdi commented 10 months ago

@sierra-moxon

http://skyhook.berkeleybop.org/full-issue-325-gopreprocess/annotations/ The requested URL was not found on this server.

leemdi commented 10 months ago

@sierra-moxon as of 6:53AM/EST, still cannot access URL