Closed kltm closed 6 months ago
Happy to meet, I'm free this afternoon: my interpretation of @ukemi's comments is that I need to include snapshot
versions of http://snapshot.geneontology.org/products/upstream_and_raw_data/paint_mgi.gpad.gz and http://snapshot.geneontology.org/products/upstream_and_raw_data/mgi-prediction.gaf in the pre-pipeline code so that it adds them to the GPAD 2.0 file it produces.
Thanks, Sierra. Let me know when a new merged_gpad file is available. (I was out yesterday).
Just for record keeping. I have added all of the new IEA refs to the MGI database. @pgaudet these should probably be added to the GO-REF markdown pages.
GO_REF:0000024 = MGI:7550374 GO_REF:0000036 = MGI:7550381 GO_REF:0000041 = MGI:7550382 GO_REF:0000044 = MGI:7550383 GO_REF:0000104 = MGI:7550384 GO_REF:0000107 = MGI:7550385 GO_REF:0000111 = MGI:7550387 GO_REF:0000114 = MGI:7550388 GO_REF:0000116 = MGI:7550389 GO_REF:0000117 = MGI:7550391
@ukemi
Are the Jxxx IDs deprecated and replaced by these MGI IDs?
No, we have both J:xxx and MGI:xxx. I added the new J:xxx to David's list. These are new References in MGI that David added on 11/21.
GO_REF:0000024 = MGI:7550374 J:342587 GO_REF:0000036 = MGI:7550381 J:342596 GO_REF:0000041 = MGI:7550382 J:342601 GO_REF:0000044 = MGI:7550383 J:342604 GO_REF:0000104 = MGI:7550384 J:342612 GO_REF:0000107 = MGI:7550385 J:342605 GO_REF:0000111 = MGI:7550387 J:342606 GO_REF:0000114 = MGI:7550388 J:342607 GO_REF:0000116 = MGI:7550389 J:342608 GO_REF:0000117 = MGI:7550391 J:342609
So the pipeline will export all 3 references for an annotation using any of these GO_REFs?
which "pipeline" are you referring to? GOC or MGI?
GOC (I thought GOC was going to be replacing the MGI pipeline, at some point)
then that is a question for Sierra. I think just GO_REF will be in the GOC/pipeline that is going to generate the mgd.gpad. MGI will then use the new mgd.gpad and use that file to load the data into the MGI/database. Does that answer your question?
Yes! Thanks
for documentation purposes: GOC generates mgd.gpad -> MGI picks up this file -> MGI-pipeline runs the MGI-pipeline will run some MGI-sanity checks: if MGI-sanity check finds GO_REF or PubMed ids that are not in MGI, then David/Li will add them to MGI if MGI-sanity check finds UBERON ids that cannot be matched to MGI/EMAPA, then Terry will work on these. MGI-sanity check reports GO ids that are obsolete in MGI. MGI-sanity check reports duplicates that we do not want to load into MGI.
@sierra-moxon not a big deal, but I found these typos in the mgd.gpad. This should all be: UniProtKB, I think?
UniprotkB UniprotKB UniProtKb UniPRotKB UniPROtKB UNiProtKB UnIProtKB
@leemdi @sierra-moxon What is the source for these. While we would attempt to "fix these", that would likely fall under the GO Rules checks, rather than the import step (unless this is being introduced at our end). Ideally, there would be feedback upstream to get them dealt with, otherwise GORULE:0000027 (also see https://github.com/geneontology/go-site/issues/1218).
@sierra-moxon will there be a new sierra-file before Thursday's meeting?
My next step is to produce the GPAD I gave you, directly from the pipeline test I have running (to address Seth's comments about passing the newly generated files through the GORules checks, etc. before they end up on your plate). I will do my best to have that before Thursday.
@sierra-moxon and @leemdi I have spent a good part of the day looking at the load that @leemdi did with the file. Things are looking pretty good on this end. In particular, with the switch of one of the GO_REFS for the SPKW load, we seem to have picked up that set of annotations now. I have focused a bit on the genes that lost annotations between our production database and the test load. Some results are linked below, but the bottom line is I am already seeing trends. Some of these will disappear when the PAINT/IBA annotations are included in @sierra-moxon 's new file. For the 'real' differences, it looks like there are some annotations missing that would be derived from the isoform file. However, the pipeline is behaving exactly as we had designed it because those annotations are from identifiers that we would not necessarily consider to be isoforms and are not in our GPI file. If we want to pick those up, the isoforms should be curated into PRO and become official annotatable objects in MGI.
https://docs.google.com/spreadsheets/d/1emrtGj2IwYSrq2_PUEMicj95hC25SsHhWuRzp_jTxPE/edit#gid=0
I'm still seeing issues with the providers when I look at @leemdi's load. It would be easiest to look at this together on Thursday, but wanted to make a note of it.
Also note that because of a 1->2 mapping of GO-REF with MGI ref, all of the annotations from the ISO loads, both human and rat, are being attributed to the rat load. We need to fix this as we work with GO_REF 96. We should either split that ref or @leemdi should put a step in out load to distinguish rat from human. Let's discuss on Thursday.
My two-cents : this should be done at GOC, not at MGI. The point was for MGI to pick up one file and do very simple sanity checks. I don't think MGI should have to make the human vs rat decision. There should be a 1-1 mapping of GO-REF with MGI references.
is this where I should go to grab the MGI gpad file, eventually...
http://current.geneontology.org/annotations/mgi.gpad.gz
thanks.
Also note that because of a 1->2 mapping of GO-REF with MGI ref, all of the annotations from the ISO loads, both human and rat, are being attributed to the rat load. We need to fix this as we work with GO_REF 96. We should either split that ref or @leemdi should put a step in out load to distinguish rat from human. Let's discuss on Thursday.
I can definitely change the pub for each ISO load.
While testing the MGI/public pipeline, we found this bad URL. If we find a new URL, we will let you know.
Per David: Unipathways is a project that no longer being developed, but I didn’t realize the URL had gone stale. I will investigate.
https://github.com/geneontology/
Ticket about UniPathway: https://github.com/geneontology/go-site/issues/2208
Summary of today's call:
side question:
which GO OBO file should we be using?
• http://purl.obolibrary.org/obo/go.obo • http://purl.obolibrary.org/obo/go/snapshot/go-basic.obo
what is the difference? MGI is using the snapshot. But go.obo seems more current. why are there 2?
thanks.
ok, I found this, which I guess sort of answers my question.
https://geneontology.org/docs/download-ontology/
@leemdi Theoretically, there should be a high-frequency snapshot
release (latest and greatest) and release
(slow and qc'ed). Due to some somewhat recent issues we have with the pipeline (https://github.com/geneontology/pipeline/issues/316 https://github.com/geneontology/pipeline/issues/349), snapshot
has fallen behind. We are currently working on fixes.
This is LI's new meeting;
Join Zoom Meeting https://jacksonlab.zoom.us/j/89272071900?pwd=enI4dmZKYWF2VTZ5OC9OR21XQVVHZz09
We were here:
This is LI's new meeting;
Join Zoom Meeting https://jacksonlab.zoom.us/j/89272071900?pwd=enI4dmZKYWF2VTZ5OC9OR21XQVVHZz09
I had sent this info here:
https://github.com/geneontology/go-site/issues/2043
We met and I have Sierra's new file and am trying to process it.
Lori
@leemdi Just to clarify, all meetings with GO people need to be added to the GO calendar--we generally should not coordinate stuff like this on the tracker as the information doesn't propagate. If you need edit access, please let me know.
@LiNiMGI Li, do you know how to add meeting to the GO calendar? make sure you add next week's Wednesday meeting.
@leemdi @LiNiMGI I'm seeing that there could be permission problems here. Shall I add you both with edit permissions and, if so, I'd be using your gmail (unless MGI is Googleverse)?
Thanks @kltm , yes, please add me with edit permission.
I have run Sierra's new mgi.gpad.gz file on my test area.
I have been using a MGI/GO QC report to compare: # of MGI Markers w/out GO Annotations
MGI Production: 268 MGI/Scrum (last run): 713 New Sierra file: 561
getting closer.
We have about 34 missing PubMed ids in MGI Production that Li will have to add to our Lit Triage pipeline.
And we are missing GOREF:0000033 as well.
Once we get these References added to MGI, I will run again and I expect the numbers to get better.
Many thanks.
Summary of today’s call • @sierra-moxon has a new file that passed the pipeline run. • @sierra-moxon will fix the IDs in column 1 and the providers etc. • @leemdi will try to process it; MGI is going to do some test. • We will meet again next Wednesday at 3:00pm
links to the new flies: https://build.geneontology.org/job/geneontology/job/pipeline/job/full-issue-325-gopreprocess/ http://skyhook.berkeleybop.org/full-issue-325-gopreprocess/annotations/
@LiNiMGI I believe you should now have edit permissions to the GO Calendar at your jax address.
Issue with GOREF:0000033 is that it should be GO_REF:0000033
GOREF vs GO_REF.
So, this is an issue in mgi.gpad or whatever feeds it.
this affects 68984 annotations.
So, I can work around this in my testing, but this GOREF needs to be fixed -> GO_REF
Fixing GOREF -> GO_REF in MGI so I can continue testing
MGI Production: 268 MGI/Scrum (last run): 713 New Sierra file: 217
which means the new Sierra file brought in more Genes with GO Annotations then in Production, which I assume is a good thing.
MGI:2672974 Defb39, defensin beta 39, Chr 8
on MGI/Production: GO:0005576 UniProtKB-KW:KW-0964 this comes from our uniprot load
on MGI/Scrum (Lori's test area), I am not seeing the UniProt annotations that I normally see.
this GO id is not in Sierra's file: MGI:MGI:2672974 RO:0002331 GO:0002227 GOREF:0000033 ECO:0000318 PANTHER:PTN000483422|UniProtKB:P60022 2020-06-25 GO_Central MGI:MGI:2672974 RO:0002327 GO:0031731 GOREF:0000033 ECO:0000318 PANTHER:PTN000483422|UniProtKB:P60022 2020-06-25 GO_Central MGI:MGI:2672974 RO:0002331 GO:0050830 GOREF:0000033 ECO:0000318 PANTHER:PTN000483422|UniProtKB:Q6QLQ9|UniProtKB:Q6IV18|UniProtKB:P60022 2020-06-25 GO_Central MGI:MGI:2672974 RO:0002331 GO:0050829 GOREF:0000033 ECO:0000318 PANTHER:PTN000483422|UniProtKB:P60022|UniProtKB:Q6IV18|UniProtKB:Q6QLQ9 2020-06-25 GO_Central MGI:MGI:2672974 RO:0002432 GO:0005615 GOREF:0000033 ECO:0000318 PANTHER:PTN000483422|UniProtKB:P60022|UniProtKB:Q6QLQ9 2020-06-25 GO_Central
@sierra-moxon
These are the unique field 10/assigned by, so far:
GO_Central: 213,130 MGI: 312,399 SynGO: 10,310 UniProt : 419 WB : 5
@sierra-moxon @kltm @LiNiMGI @ukemi
Remaining issues:
PR:MGI:MGI: -> MGI:MGI: PR:PR: -> PR:
field 10/provider/assigned by : check 12/21 noon/EST Rules are here:https://docs.google.com/spreadsheets/d/1Jyvd8Ct8ZvZpwDTTV01ZzyQezluziHS38JZgSexP5uU/edit#gid=0
missing meta-data from noctua model; this is really important for MGI curators :
this can be done but Sierra needs to talk to Seth, etc.
for annotations from noctua models, field 12 is currently empty, but should contain: contributor noctua-model-id model-state (see the current Noctua GPAD1.2 files in the products directory)
For the isoform file, we should be checking the identifiers against the MGI GPI file and taking the ones that are there. A few examples we are catching in our report: UniProtKB:A2ASQ1-2 Agrn UniProtKB:D3YX90 Adamts17 UniProtKB:E9PYV8 Adamts9 UniProtKB:F7AAP4 Atp2b4
from GOA isoform file:
UniProtKB A2ASQ1 Agrn enables GO:0005201 PMID:22159717 RCA F Agrin Agrn|Agrin protein taxon:10090 20180725 BHF-UCL occurs_in(UBERON:0002048) UniProtKB:A2ASQ1-2
UniProtKB A2ASQ1 Agrn located_in GO:0062023 PMID:22159717 HDA C Agrin Agrn|Agrin protein taxon:10090 20180725 BHF-UCL part_of(UBERON:0002048) UniProtKB:A2ASQ1-2
matches rows in MGI gpi:
MGI:MGI:87961 Agrn agrin NMF380|Agrin|nmf380 SO:0001217 NCBITaxon:10090 UniProtKB:A2ASQ1
and
PR:A2ASQ1 mAGRN agrin (mouse) mAGRN PR:000000001 NCBITaxon:10090 MGI:MGI:87961 UniProtKB:A2ASQ1
so that is why we have the non -2
version of this row in the output GAF files.
Yes, this is due to the conflation of genes and proteins. The Uniprot identifer can refer to both. Specifically for the isoform file, the annotation included should only represent proteins so the mapping to the gene identifier: MGI:MGI:87961 Agrn agrin NMF380|Agrin|nmf380 SO:0001217 NCBITaxon:10090 (note the MGI id in column 1) should be ignored and the mapping to the protein identifier: should be used as the annotation object. PR:A2ASQ1 mAGRN agrin (mouse) mAGRN PR:000000001 NCBITaxon:10090 MGI:MGI:87961
PS. It would be really nice if we could cleanly disentangle proteins fro genes in the Uniprot annotations.
PPS. The converse is true for the non-isoform mouse file. It should only be used to generate annotations to MGI genes (MGI:MGI:#####).
@sierra-moxon @LiNiMGI Yesterday you showed a report that contained a bunch of errors where MGI annotations were failing GO-rule 1. But when I go and look at the live report, I don't see them: http://snapshot.geneontology.org/reports/gorule-report.html
FYI, this problem finally caught up with us at MGI/Production on Dec 17/18.
See earlier comments:
Issue with GOREF:0000033 is that it should be GO_REF:0000033
GOREF vs GO_REF.
So, this is an issue in mgi.gpad or whatever feeds it.
this affects 68984 annotations.
So, I can work around this in my testing, but this GOREF needs to be fixed -> GO_REF
Ticket to fix this is here: https://github.com/geneontology/go-site/issues/2185#issuecomment-1860864834
I re-ran the pipeline on Dec 20, to show fixes to the issues above, and though it completed successfully, when I went searching for files this morning, I could not find them. I am re-running today.
@sierra-moxon
http://skyhook.berkeleybop.org/full-issue-325-gopreprocess/annotations/ The requested URL was not found on this server.
@sierra-moxon as of 6:53AM/EST, still cannot access URL
This was created from a conversation @ukemi and @sierra-moxon , making explicit an implicit task.
geneontology/gopreprocess#9