Closed kltm closed 7 months ago
@ukemi Listed as: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human_isoform.gaf.gz Any reason not to use: http://snapshot.geneontology.org/annotations/goa_human_isoform.gaf.gz ?
Tagging @sierra-moxon @ukemi
@kltm The above is an interesting question that I also pondered. Particularly, why do we use the rat file from the GOC and the human file from UniProt. I couldn't find a documented reason, but I assume there was one. That said, to address the issue more directly, do you know if these files differ and if so how? It could be that we wanted the human file to be from the original source of the annotations (the 'truth'), but then I don't know why we get the rat one from the GOC. Or ift could be that we had decided that the file from UniProt was more up-to date than the one at the GOC. If the file at the GOC directly reflects the file from UniProt, then I suspect we could just pick that up.
From Lori:
We use this Alliance file: fms.alliancegenome.org/download/ORTHOLOGY-ALLIANCE_COMBINED.tsv.gz
This file is Orthology Filter: Stringent
@ukemi Well, happily in this case, I note that we are drawing it from: https://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human_isoform.gaf.gz--same file. That means the changes defined are: http://snapshot.geneontology.org/reports/goa_human_isoform-report.html. With that, whatever source we use should be fine as it will go through the GORULEs anyways.
tagging @leemgi
@ukemi @kltm @sierra-moxon
• We use this Alliance file: fms.alliancegenome.org/download/ORTHOLOGY-ALLIANCE_COMBINED.tsv.gz o This file is Orthology Filter: Stringent • We use the Rat markers from Entrezgene to load our Rat genes/RGD: into MGI o http://ftp.ncbi.nih.gov/gene/DATA o gene2accession.gz gene2pubmed.gz gene2refseq.gz gene_history.gz gene_info.gz • We use columns 1 = MGI, column 5 = RGD, going in one direction only • As long as both MGI: and RGD:* exist in our database, we load the MGI/Rat association.
Let me know if there is any other info you need about this.
Thanks. Lori
documenting: I'm going to use snapshot file that @kltm posted. :) we can always change this if necessary.
today is my last work day before vacation. i will be back to work on July 10. talk to you then.
thanks for the heads up @leemdi - I started with the rat annotations and have a GAF file based on your requirements. (you should get an email about it). the status here is that I am currently comparing my output to the file I can download from you (MGI - the nonoctua file). I'll post this over in the rat ticket too, but I would be happy for feedback when you get back. :)
@sierra-moxon @ukemi thanks! I know David wanted to take a look. So he can do his part while I am away. Will touch base with you when I get back on July 10.
@ukemi @kltm - what file do you want me to use to map UniProtKB identifiers (in the goa_human_isoform.gaf.gz file) to HGNC identifiers, so that I can use Alliance orthology file to get MGI identifiers from HGNC identifiers?
@sierra-moxon Doesn't the alliance have this info already? It might be worthwhile to extract it from them or request a file that has mappings that works for us.
Not in the official downloads page: https://www.alliancegenome.org/downloads
My insider information let me find this: https://fms.alliancegenome.org/api/snapshot/release/5.0.0 (release 5.0.0 version), I see a cross reference file in the payload: https://download.alliancegenome.org/4.2.0/GENECROSSREFERENCE/COMBINED/GENECROSSREFERENCE_COMBINED_0.tsv.gz. (the 4.2.0 confuses me).
I'll use this for now: https://fms.alliancegenome.org/download/GENECROSSREFERENCE_COMBINED.tsv.gz
Good question! I knew we would need @leemdi at some point. @cindyJax do you know which Alliance file MGI uses for human orthology mapping? When I try to look at the load file from our software page, it hangs up. PS. So far the rat file looks awesome.
"Alliance combined orthology data"
Thanks, @ukemi @kltm @cindyJax - I definitely have the ortho file. Since the human GAF file uses UniProtKB identifiers, and the Alliance uses HGNC identifiers, I need a file to help me translate between those namespaces. To get that piece, I'll use this for now: https://fms.alliancegenome.org/download/GENECROSSREFERENCE_COMBINED.tsv.gz. Once translated to HGNC, I can then use the orthology file from Alliance to translate HGNC ids to MGI ids.
It's a bit unfortunate that they are not xref'd in the human GPI file.
I wonder if this report might be useful: http://www.informatics.jax.org/downloads/reports/HOM_MouseHumanSequence.rpt
from the meeting today:
Thanks @sierra-moxon! I was going to update this first thing this morning, but you beat me to it.
the Alliance XREF file has UniProt/HGNC mapping, but its only the reference proteome. Sierra will try the file @ukemi suggested above, from MGI directly. If more folks want to use this pipeline, this is one area where they will need to provide a file to use
A solution to this would be to have the resources that are responsible for individual species include xrefs in their GPI files. Specifically in this case, if Uniprot could create a human GPI that contains xrefs to HGNC for every object they define as 'annotatable'. This could even be done transitively where each proteoform refers to a gene centric identifier which xrefs an HGNC identifier.
@sierra-moxon I just went back and double-checked the gaf and gpad file specs. I had remembered correctly that the gaf doesn't allow comments or anything similar. The gpad does have a field for annotation_properties. In the gpad we could either add the original PMIDs as comments or create a specific property for them. As long as we are consistent, I think we could parse them to populate the fields in MGI that I was showing you yesterday.
swapped to the MGI UniProt file, still seeing quite a few less human annotations than I should see.
two restrictions from the rat->mouse specification (I am using the same rules for the human->mouse conversion) are causing this:
the source human GAF file uses primarily GO_REF: ids instead of PMIDs. I probably should all GO_REF to stand in for PMIDs right?
~100K annotations are skipped in my pipeline from the human GAF file because they use evidence codes (e.g. IEA, ISO, ND, etc..) that are not in the restricted evidence code list. Should this change based on species?
Here's another difference: in the first line of Lori's file here I see "UniProtKB:P23528" (https://docs.google.com/spreadsheets/d/1IpyR8K9JRGu-ehoatSu0yjOm8pesjdalOJkjoNXc5yQ/edit#gid=0).
MGI | MGI:101757 | Cfl1 | located_in | GO:0005737 | MGI:MGI:4834177\|GO_REF:0000096 | ISO | UniProtKB:P23528 | C | cofilin 1, non-muscle | Cof\|cofilin\|n-cofilin | protein_coding_gene | taxon:10090 | 20151117 | MGI
I don't see P23528
anywhere in either of these two files:
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human_isoform.gaf.gz
http://snapshot.geneontology.org/annotations/goa_human_isoform.gaf.gz
In this file: http://snapshot.geneontology.org/annotations/goa_human.gaf.gz I only see it as an IBA annotation, e.g.:
UniProtKB P23528 CFL1 enables GO:0051015 PMID:21873635 IBA PANTHER:PTN000227258|PomBase:SPAC20G4.06c|TAIR:locus:2059861|UniProtKB:P23528|SGD:S000003973|WB:WBGene00006794|UniProtKB:P60981|MGI:MGI:101763|UniProtKB:P10668|MGI:MGI:1929270|MGI:MGI:101757|RGD:69285 F Cofilin-1 CFL1|CFL protein taxon:9606 20220922 GO_Central
UniProtKB P23528 CFL1 is_active_in GO:0005737 PMID:21873635 IBA PANTHER:PTN000227258|PomBase:SPAC20G4.06c|SGD:S000003973|WB:WBGene00006794|UniProtKB:P23528|UniProtKB:Q8I467|UniProtKB:C4LVG4|UniProtKB:Q9Y281|UniProtKB:Q580V7|dictyBase:DDB_G0277833|RGD:69285|MGI:MGI:101757|MGI:MGI:1929270 C Cofilin-1 CFL1|CFL protein taxon:9606 20230602 GO_Central
I don't think I should be generating this annotation - Is it possible that I don't have the correct file (either I am comparing to the wrong "Lori's file" or maybe I am using the wrong human GAF file)?
@sierra-moxon We should definitely be using the same filters and not loading IBAs. If I look at the annotation at MGI, the source PMID is PMID:25556234 (one of the reasons it is nice to have this loaded). I see that annotation at UniProt (https://www.ebi.ac.uk/QuickGO/annotations?geneProductId=P23528):
UniProtKB:P23528 | CFL1 | located_in | GO:0005737 cytoplasm | ECO:0000314 IDA | PMID:25556234 | | 9606 Homo sapiens | AgBase
If I download the human annotation gaf from the GO site (http://geneontology.org/gene-associations/goa_human.gaf.gz), I see it:
UniProtKB P23528 CFL1 located_in GO:0005737 PMID:25556234 IDA C Cofilin-1 CFL1|CFL protein taxon:9606 20151117 AgBase
I also see it here (http://snapshot.geneontology.org/annotations/goa_human.gaf.gz).
UniProtKB P23528 CFL1 located_in GO:0005737 PMID:25556234 IDA C Cofilin-1 CFL1|CFL protein taxon:9606 20151117 AgBase
So the question is why are you not seeing it when you download the file??????
Note to @leemdi and @sierra-moxon on annotations that are in the generated file versus the MGI file:
I'm starting to go through the differences in @sierra-moxon 's file and @leemdi 's file for the Uniprot human ISO annotations. @sierra-moxon has an annotation in the @leemdi file . If I go to the load logs, I see that the annotation has been kicked out of the MGI load and is in the error reports:
89569 NON_1TO1_P 44596110 UniProtKB P01911 HLA-DRB1 involved_in GO:0032831 PMID:28467828 IDA P HLA class II histocompatibility antigen, DRB1 beta chain HLA-DRB1 protein taxon:9606 20201030 UniProt
I believe this is because there is not a 1:1 correspondence of the UniProt identifier to an MGI marker and this is a process annotation, correct? I had forgotten that we had this restriction in the load. Sierra, the UniProt identifier is associated with 2 marker in MGI, H2-Eb1 (MGI:95901) and H2-Eb2 (MGI:95902). Since paralogs in an organism often are involved in different processes, we conservatively don't make the association of any gene that maps to more than one mouse gene for biological process.
I hope this makes sense.
Notes from Wednesday stand-up meeting.
https://docs.google.com/document/d/186JSYVJpWDf_vgJSkwiSNq3H9wW2tw5Db_4kQsKJW8Y/edit https://docs.google.com/spreadsheets/d/1bG-AeilN8H1Ul8nHDRtiVnAht1YjdN8jZZ4EwiX18q8/edit#gid=0
@sierra-moxon has created a new set of files in the 9/12 folder of the project. @sierra-moxon see the comment that is two posts above for the rule we were discussing.
Hi @sierra-moxon and @leemdi
Today I spent the day analyzing the annotations that are in @leemdi 's file and not @sierra-moxon 's.
The results are here: https://docs.google.com/document/d/1Cg8tYvzr8BY1fG-cyC2uqhxzQ5xAl25udk5NTRcJfwQ/edit
What I found fell into two categories:
Hi @sierra-moxon and @leemdi
Today I spent the day analyzing the annotations that are in @sierra-moxon 's file and not @leemdi 's.
The results are here:https://docs.google.com/document/d/1os6eP5s1tWJFQUG0m01l0wD9ViO_YAQjLkJBbKyFttU/edit
Here are the categories:
Hi @sierra-moxon and @leemdi
Today @kltm and @leemdi and I met and went over the annotations that were in @leemdi 's file and weren't in @sierra-moxon 's file. @leemdi is going to look over the logic for determining whether there is a 1:1 orthology relationship between an MGI gene and a UniProt protein/gene. @leemdi will report back to @ukemi why the annotations in her file were missed. @ukemi suspects this is an error.
We decided on today's call that rather than modify @leemdi's logic, we would just fix the problem on the GOC side. It looks like @sierra-moxon has already filtered out those annotations correctly. However, note that this logic is not completely working on @sierra-moxon's side either since there are some annotations in her file that @leemdi has filtered out for the non-1:1 reason.
@leemdi has looked into the logic that filters the process annotations when there isn't a 1:1 correspondence between MGI markers and human genes. For this logic we use the Alliance orthology files. The file is here: https://www.alliancegenome.org/downloads#orthology I think @sierra-moxon should use this file to determine the 1:1 mappings in the same way as MGI. Based on my analysis of @leemdi's file, there may be issues with how this is done at the moment on our end, but it would be best to correct this on the GOC's end moving forward. From my selfish point of view, I will need a way to filter these annotations from the @leemdi file when I do my comparison. Let's discuss this at our next call.
To do this, it looks like we will need a way to map the UniProt ids in the annotation file to the HGNC ids in the Alliance orthology file.
Or maybe not: At the level of an MGI marker, we would have filtered process annotations to these three. Below are lines from the Alliance file:
MGI:107729 Igtp NCBITaxon:10090 Mus musculus HGNC:29597 IRGM NCBITaxon:9606 Homo sapiens OrthoInspector|OrthoFinder|PANTHER|SonicParanoid|PhylomeDB 5 10 Yes No MGI:107567 Irgm1 NCBITaxon:10090 Mus musculus HGNC:29597 IRGM NCBITaxon:9606 Homo sapiens OrthoInspector|OrthoFinder|Ensembl Compara|PANTHER|SonicParanoid|PhylomeDB 6 10 Yes Yes MGI:1926262 Irgm2 NCBITaxon:10090 Mus musculus HGNC:29597 IRGM NCBITaxon:9606 Homo sapiens OrthoInspector|OrthoFinder|HGNC|PANTHER|SonicParanoid|PhylomeDB 6 10 Yes Yes
@ukemi If more files or input need to come in, let me know. I'd like to make sure we have a handle on all the dependencies (as we originally set ourselves up for that to be a very low number).
sorry; not enough time to properly respond until later this week, but I do use Alliance ortho file already, I do use a HGNC <-> UniProt mapping file from MGI. I think during our RGD file review, we wanted 1->many gene mappings (and vice versa) to show up ... I need to dig into your examples though, and modify the human conversion appropriately (e.g. if we only want 1:1 for human, or I interpreted the 1:many requirement incorrectly for RGD and need to fix that I certainly can).
Hi @sierra-moxon. Yay, that you use the Alliance file. I think we should have the same rules in place for human and rat. We want to make all of this as universal as possible for those who will follow in the future.
@kltm. So far as I see it the only problematic file wrt making this a universal (ie non-MOD-specific) process is our use of an MGI-specific file for the human UniProt_ID->MGI ID mappings. Ideally we would like this to come from the Alliance (future). Big win that @sierra-moxon is already using the Alliance file to detect the non-1:1s. The Alliance file is global, so let's say we wanted to move this on to another MOD, we would still use that file and just tweak it a little for the different species (he says not actually writing the code).
At the risk of being a pest, keep in mind that this 1:1 rule only applies to filtering out BP annotations.
I have finished going over the 'unique_to_lori' file and added my analysis results to @sierra-moxon's report: https://docs.google.com/spreadsheets/d/1VAHU-exP-j__gm7X84Szj9KWUDtEM5nJCyisYgF8SFQ/edit#gid=0
After examining about 1/3 of the annotations they seem to fall into 2 categories
I have finished going over the 'unique_to_sierra' file and added my analysis results to @sierra-moxon's report: https://docs.google.com/spreadsheets/d/1ziaXNYZZXI_PrLKW0_bs1N6zwZ97pPzEgic1MzI_Cgs/edit#gid=0
I wasn't able to review as many annotations as with the other file, but the results seemed to fall into several categories.
Conclusion from today's call is that @sierra-moxon will try to deal with detecting the NOT annotations that already exist and suppress the ISOs, but we will leave the duplicates as a bigger GO issue to be dealt with at a later date.
currently starting to work on detecting NOT annotations that already exist and suppressing ISOs that assert the non-negated form of the annotation.
@LiNiMGI Can you check whether this NOT issue has been resolved?
These should be loaded with correct date and assigned_by GOC
Can this ticket be closed?
@LiNiMGI to check whether the NOT issue has been resolved.
But in general, where should this occur in the pipeline, should it be a GO rule?
@pgaudet the first "not " rule (see above) was implemented in the pipeline. the second "not" rule is very hard to do without a database, we will have to hold on it.
We can discuss further with other curators about how to handle inferred "not" annotations in general. For example, GO_REF:0000024 particularly said:Only annotations with an experimental evidence code and which do not have the 'NOT' qualifier are transferred, this is in agreement with the above first "not" rule.
we can close this ticket for now and bring the "not" discussion somewhere else. Thanks, Li
Replace the current MGI upstream pipeline with a local GO pipeline for the human orthology file.
Would be using the GO human file, the MGI GPI file, and a TBD (AGR) orthology file.