Add human orthology file to MGI datastream

kltm commented 1 year ago

Replace the current MGI upstream pipeline with a local GO pipeline for the human orthology file.

Would be using the GO human file, the MGI GPI file, and a TBD (AGR) orthology file.

kltm commented 1 year ago

@ukemi Listed as: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human_isoform.gaf.gz Any reason not to use: http://snapshot.geneontology.org/annotations/goa_human_isoform.gaf.gz ?

kltm commented 1 year ago

Tagging @sierra-moxon @ukemi

ukemi commented 1 year ago

@kltm The above is an interesting question that I also pondered. Particularly, why do we use the rat file from the GOC and the human file from UniProt. I couldn't find a documented reason, but I assume there was one. That said, to address the issue more directly, do you know if these files differ and if so how? It could be that we wanted the human file to be from the original source of the annotations (the 'truth'), but then I don't know why we get the rat one from the GOC. Or ift could be that we had decided that the file from UniProt was more up-to date than the one at the GOC. If the file at the GOC directly reflects the file from UniProt, then I suspect we could just pick that up.

kltm commented 1 year ago

From Lori:

We use this Alliance file: fms.alliancegenome.org/download/ORTHOLOGY-ALLIANCE_COMBINED.tsv.gz

This file is Orthology Filter: Stringent

kltm commented 1 year ago

@ukemi Well, happily in this case, I note that we are drawing it from: https://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human_isoform.gaf.gz--same file. That means the changes defined are: http://snapshot.geneontology.org/reports/goa_human_isoform-report.html. With that, whatever source we use should be fine as it will go through the GORULEs anyways.

ukemi commented 1 year ago

tagging @leemgi

leemdi commented 1 year ago

@ukemi @kltm @sierra-moxon

• We use this Alliance file: fms.alliancegenome.org/download/ORTHOLOGY-ALLIANCE_COMBINED.tsv.gz o This file is Orthology Filter: Stringent • We use the Rat markers from Entrezgene to load our Rat genes/RGD: into MGI o http://ftp.ncbi.nih.gov/gene/DATA o gene2accession.gz gene2pubmed.gz gene2refseq.gz gene_history.gz gene_info.gz • We use columns 1 = MGI, column 5 = RGD, going in one direction only • As long as both MGI: and RGD:* exist in our database, we load the MGI/Rat association.

Let me know if there is any other info you need about this.

Thanks. Lori

sierra-moxon commented 1 year ago

documenting: I'm going to use snapshot file that @kltm posted. :) we can always change this if necessary.

leemdi commented 1 year ago

today is my last work day before vacation. i will be back to work on July 10. talk to you then.

sierra-moxon commented 1 year ago

thanks for the heads up @leemdi - I started with the rat annotations and have a GAF file based on your requirements. (you should get an email about it). the status here is that I am currently comparing my output to the file I can download from you (MGI - the nonoctua file). I'll post this over in the rat ticket too, but I would be happy for feedback when you get back. :)

leemdi commented 1 year ago

@sierra-moxon @ukemi thanks! I know David wanted to take a look. So he can do his part while I am away. Will touch base with you when I get back on July 10.

sierra-moxon commented 1 year ago

@ukemi @kltm - what file do you want me to use to map UniProtKB identifiers (in the goa_human_isoform.gaf.gz file) to HGNC identifiers, so that I can use Alliance orthology file to get MGI identifiers from HGNC identifiers?

kltm commented 1 year ago

@sierra-moxon Doesn't the alliance have this info already? It might be worthwhile to extract it from them or request a file that has mappings that works for us.

sierra-moxon commented 1 year ago

Not in the official downloads page: https://www.alliancegenome.org/downloads

My insider information let me find this: https://fms.alliancegenome.org/api/snapshot/release/5.0.0 (release 5.0.0 version), I see a cross reference file in the payload: https://download.alliancegenome.org/4.2.0/GENECROSSREFERENCE/COMBINED/GENECROSSREFERENCE_COMBINED_0.tsv.gz. (the 4.2.0 confuses me).

I'll use this for now: https://fms.alliancegenome.org/download/GENECROSSREFERENCE_COMBINED.tsv.gz

ukemi commented 1 year ago

Good question! I knew we would need @leemdi at some point. @cindyJax do you know which Alliance file MGI uses for human orthology mapping? When I try to look at the load file from our software page, it hangs up. PS. So far the rat file looks awesome.

cindyJax commented 1 year ago

"Alliance combined orthology data"

https://www.alliancegenome.org/downloads#orthology

sierra-moxon commented 1 year ago

Thanks, @ukemi @kltm @cindyJax - I definitely have the ortho file. Since the human GAF file uses UniProtKB identifiers, and the Alliance uses HGNC identifiers, I need a file to help me translate between those namespaces. To get that piece, I'll use this for now: https://fms.alliancegenome.org/download/GENECROSSREFERENCE_COMBINED.tsv.gz. Once translated to HGNC, I can then use the orthology file from Alliance to translate HGNC ids to MGI ids.

ukemi commented 1 year ago

It's a bit unfortunate that they are not xref'd in the human GPI file.

ukemi commented 1 year ago

I wonder if this report might be useful: http://www.informatics.jax.org/downloads/reports/HOM_MouseHumanSequence.rpt

sierra-moxon commented 1 year ago

from the meeting today:

the Alliance XREF file has UniProt/HGNC mapping, but its only the reference proteome. Sierra will try the file @ukemi suggested above, from MGI directly. If more folks want to use this pipeline, this is one area where they will need to provide a file to use.
RGD file looks pretty good; it is missing ~360 annotations out of 35,000 when compared with MGI File. This might be "good enough."
- sierra will run the differ on these two files to see if we can pinpoint the discrepancy.
- we don't have a place in the GAF file to store the original publication (right now that could go into a GPAD output file from this pipeline if we want it to). sierra will see about creating a GPAD file as additional product here.
- we don't want to convert RGD ids in the "with_support_from" column back to UniProt identifiers in the rat version of the conversion, but we do want to convert HGNC ids back to UniProt ids in the human version of the conversion.
The human file is way too small (this is likely a result of the reference proteome ids in the Alliance XREF file). sierra will update to use the file from MGI instead and we can see if that increases the number of lines the new pipeline is generating.

ukemi commented 1 year ago

Thanks @sierra-moxon! I was going to update this first thing this morning, but you beat me to it.

the Alliance XREF file has UniProt/HGNC mapping, but its only the reference proteome. Sierra will try the file @ukemi suggested above, from MGI directly. If more folks want to use this pipeline, this is one area where they will need to provide a file to use

A solution to this would be to have the resources that are responsible for individual species include xrefs in their GPI files. Specifically in this case, if Uniprot could create a human GPI that contains xrefs to HGNC for every object they define as 'annotatable'. This could even be done transitively where each proteoform refers to a gene centric identifier which xrefs an HGNC identifier.

ukemi commented 1 year ago

@sierra-moxon I just went back and double-checked the gaf and gpad file specs. I had remembered correctly that the gaf doesn't allow comments or anything similar. The gpad does have a field for annotation_properties. In the gpad we could either add the original PMIDs as comments or create a specific property for them. As long as we are consistent, I think we could parse them to populate the fields in MGI that I was showing you yesterday.

sierra-moxon commented 1 year ago

swapped to the MGI UniProt file, still seeing quite a few less human annotations than I should see.

two restrictions from the rat->mouse specification (I am using the same rules for the human->mouse conversion) are causing this:

the source human GAF file uses primarily GO_REF: ids instead of PMIDs. I probably should all GO_REF to stand in for PMIDs right?
~100K annotations are skipped in my pipeline from the human GAF file because they use evidence codes (e.g. IEA, ISO, ND, etc..) that are not in the restricted evidence code list. Should this change based on species?

sierra-moxon commented 1 year ago

Here's another difference: in the first line of Lori's file here I see "UniProtKB:P23528" (https://docs.google.com/spreadsheets/d/1IpyR8K9JRGu-ehoatSu0yjOm8pesjdalOJkjoNXc5yQ/edit#gid=0).

MGI | MGI:101757 | Cfl1 | located_in | GO:0005737 | MGI:MGI:4834177\|GO_REF:0000096 | ISO | UniProtKB:P23528 | C | cofilin 1, non-muscle | Cof\|cofilin\|n-cofilin | protein_coding_gene | taxon:10090 | 20151117 | MGI

I don't see P23528 anywhere in either of these two files: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human_isoform.gaf.gz http://snapshot.geneontology.org/annotations/goa_human_isoform.gaf.gz

In this file: http://snapshot.geneontology.org/annotations/goa_human.gaf.gz I only see it as an IBA annotation, e.g.:

UniProtKB   P23528  CFL1    enables GO:0051015  PMID:21873635   IBA PANTHER:PTN000227258|PomBase:SPAC20G4.06c|TAIR:locus:2059861|UniProtKB:P23528|SGD:S000003973|WB:WBGene00006794|UniProtKB:P60981|MGI:MGI:101763|UniProtKB:P10668|MGI:MGI:1929270|MGI:MGI:101757|RGD:69285    F   Cofilin-1   CFL1|CFL    protein taxon:9606  20220922    GO_Central      
UniProtKB   P23528  CFL1    is_active_in    GO:0005737  PMID:21873635   IBA PANTHER:PTN000227258|PomBase:SPAC20G4.06c|SGD:S000003973|WB:WBGene00006794|UniProtKB:P23528|UniProtKB:Q8I467|UniProtKB:C4LVG4|UniProtKB:Q9Y281|UniProtKB:Q580V7|dictyBase:DDB_G0277833|RGD:69285|MGI:MGI:101757|MGI:MGI:1929270 C   Cofilin-1   CFL1|CFL    protein taxon:9606  20230602    GO_Central

I don't think I should be generating this annotation - Is it possible that I don't have the correct file (either I am comparing to the wrong "Lori's file" or maybe I am using the wrong human GAF file)?

ukemi commented 1 year ago

@sierra-moxon We should definitely be using the same filters and not loading IBAs. If I look at the annotation at MGI, the source PMID is PMID:25556234 (one of the reasons it is nice to have this loaded). I see that annotation at UniProt (https://www.ebi.ac.uk/QuickGO/annotations?geneProductId=P23528): UniProtKB:P23528 | CFL1 | located_in | GO:0005737 cytoplasm | ECO:0000314 IDA | PMID:25556234 | | 9606 Homo sapiens | AgBase If I download the human annotation gaf from the GO site (http://geneontology.org/gene-associations/goa_human.gaf.gz), I see it: UniProtKB P23528 CFL1 located_in GO:0005737 PMID:25556234 IDA C Cofilin-1 CFL1|CFL protein taxon:9606 20151117 AgBase I also see it here (http://snapshot.geneontology.org/annotations/goa_human.gaf.gz). UniProtKB P23528 CFL1 located_in GO:0005737 PMID:25556234 IDA C Cofilin-1 CFL1|CFL protein taxon:9606 20151117 AgBase
So the question is why are you not seeing it when you download the file??????

ukemi commented 1 year ago

Note to @leemdi and @sierra-moxon on annotations that are in the generated file versus the MGI file:

I'm starting to go through the differences in @sierra-moxon 's file and @leemdi 's file for the Uniprot human ISO annotations. @sierra-moxon has an annotation in the @leemdi file . If I go to the load logs, I see that the annotation has been kicked out of the MGI load and is in the error reports:

89569 NON_1TO1_P 44596110 UniProtKB P01911 HLA-DRB1 involved_in GO:0032831 PMID:28467828 IDA P HLA class II histocompatibility antigen, DRB1 beta chain HLA-DRB1 protein taxon:9606 20201030 UniProt

I believe this is because there is not a 1:1 correspondence of the UniProt identifier to an MGI marker and this is a process annotation, correct? I had forgotten that we had this restriction in the load. Sierra, the UniProt identifier is associated with 2 marker in MGI, H2-Eb1 (MGI:95901) and H2-Eb2 (MGI:95902). Since paralogs in an organism often are involved in different processes, we conservatively don't make the association of any gene that maps to more than one mouse gene for biological process.

I hope this makes sense.

ukemi commented 1 year ago

Notes from Wednesday stand-up meeting.

@sierra-moxon has created files for the diffs.

https://docs.google.com/document/d/186JSYVJpWDf_vgJSkwiSNq3H9wW2tw5Db_4kQsKJW8Y/edit https://docs.google.com/spreadsheets/d/1bG-AeilN8H1Ul8nHDRtiVnAht1YjdN8jZZ4EwiX18q8/edit#gid=0

I think the reasons for the differences are pretty straightforward.
@sierra-moxon will create a new diff with the updated files and @ukemi and @sierra-moxon will track down the differences.
Once the human and rat files are in good shape, @sierra-moxon will switch to the mock pipeline work.

ukemi commented 1 year ago

@sierra-moxon has created a new set of files in the 9/12 folder of the project. @sierra-moxon see the comment that is two posts above for the rule we were discussing.

ukemi commented 1 year ago

Hi @sierra-moxon and @leemdi

Today I spent the day analyzing the annotations that are in @leemdi 's file and not @sierra-moxon 's.

The results are here: https://docs.google.com/document/d/1Cg8tYvzr8BY1fG-cyC2uqhxzQ5xAl25udk5NTRcJfwQ/edit

What I found fell into two categories:

Annotations in in the file that were to GO BP (P), but the Uniprot identifiers were associated with more than one marker in MGI. This leads me to believe that the criteria above about filtering out those annotations isn't completely working.
Annotations that were in the file that were derived from the human isoform file. That leads me to believe that perhaps that file isn't being processed?

ukemi commented 1 year ago

Hi @sierra-moxon and @leemdi

Today I spent the day analyzing the annotations that are in @sierra-moxon 's file and not @leemdi 's.

The results are here:https://docs.google.com/document/d/1os6eP5s1tWJFQUG0m01l0wD9ViO_YAQjLkJBbKyFttU/edit

Here are the categories:

In a couple of cases, it looks like the evidence filtering might not have worked correctly and annotations got through
In almost all the cases, there was not a 1:1 mapping of mouse marker to UniProKB marker. There were multiple human genes associated with a single mouse marker. Therefore the annotations were not loaded.

ukemi commented 1 year ago

Hi @sierra-moxon and @leemdi

Today @kltm and @leemdi and I met and went over the annotations that were in @leemdi 's file and weren't in @sierra-moxon 's file. @leemdi is going to look over the logic for determining whether there is a 1:1 orthology relationship between an MGI gene and a UniProt protein/gene. @leemdi will report back to @ukemi why the annotations in her file were missed. @ukemi suspects this is an error.

We decided on today's call that rather than modify @leemdi's logic, we would just fix the problem on the GOC side. It looks like @sierra-moxon has already filtered out those annotations correctly. However, note that this logic is not completely working on @sierra-moxon's side either since there are some annotations in her file that @leemdi has filtered out for the non-1:1 reason.

ukemi commented 1 year ago

@leemdi has looked into the logic that filters the process annotations when there isn't a 1:1 correspondence between MGI markers and human genes. For this logic we use the Alliance orthology files. The file is here: https://www.alliancegenome.org/downloads#orthology I think @sierra-moxon should use this file to determine the 1:1 mappings in the same way as MGI. Based on my analysis of @leemdi's file, there may be issues with how this is done at the moment on our end, but it would be best to correct this on the GOC's end moving forward. From my selfish point of view, I will need a way to filter these annotations from the @leemdi file when I do my comparison. Let's discuss this at our next call.

ukemi commented 1 year ago

To do this, it looks like we will need a way to map the UniProt ids in the annotation file to the HGNC ids in the Alliance orthology file.

ukemi commented 1 year ago

Or maybe not: At the level of an MGI marker, we would have filtered process annotations to these three. Below are lines from the Alliance file:

kltm commented 1 year ago

@ukemi If more files or input need to come in, let me know. I'd like to make sure we have a handle on all the dependencies (as we originally set ourselves up for that to be a very low number).

sierra-moxon commented 1 year ago

sorry; not enough time to properly respond until later this week, but I do use Alliance ortho file already, I do use a HGNC <-> UniProt mapping file from MGI. I think during our RGD file review, we wanted 1->many gene mappings (and vice versa) to show up ... I need to dig into your examples though, and modify the human conversion appropriately (e.g. if we only want 1:1 for human, or I interpreted the 1:many requirement incorrectly for RGD and need to fix that I certainly can).

ukemi commented 1 year ago

Hi @sierra-moxon. Yay, that you use the Alliance file. I think we should have the same rules in place for human and rat. We want to make all of this as universal as possible for those who will follow in the future.

ukemi commented 1 year ago

@kltm. So far as I see it the only problematic file wrt making this a universal (ie non-MOD-specific) process is our use of an MGI-specific file for the human UniProt_ID->MGI ID mappings. Ideally we would like this to come from the Alliance (future). Big win that @sierra-moxon is already using the Alliance file to detect the non-1:1s. The Alliance file is global, so let's say we wanted to move this on to another MOD, we would still use that file and just tweak it a little for the different species (he says not actually writing the code).

ukemi commented 1 year ago

At the risk of being a pest, keep in mind that this 1:1 rule only applies to filtering out BP annotations.

ukemi commented 1 year ago

I have finished going over the 'unique_to_lori' file and added my analysis results to @sierra-moxon's report: https://docs.google.com/spreadsheets/d/1VAHU-exP-j__gm7X84Szj9KWUDtEM5nJCyisYgF8SFQ/edit#gid=0

After examining about 1/3 of the annotations they seem to fall into 2 categories

Annotations that were caught in @sierra-moxon's non-1:1 ortholog script but didn't seem to be caught by @leemdi. @sierra-moxon correctly filtered out the BP annotations from these cases. If you follow the Uniprot identifiers noted on the spreadsheet in the MGI search you will see the non-1:1 relationships. So bottom line is that I think this is working on the end of the new import pipeline. One result of this will be the loss of a lot of annotations to immunological processes for genes in the histocompatibility loci that are not made up by PAINT annotations. I don't see how we can avoid this.
Sets of annotations that were annotated to protein complexes using the colocalizes_with relation and converted to part_of using an MGI rule. I discussed this with @vanaukenk because when we did the cleanup of MGI annotations I reannotated all of the colocalizes_with annotations before import. We confirmed that the plan moving forward is to not use this qualifier. See: https://wiki.geneontology.org/Annotation_Relations#Standard_Annotation:_Gene_Product_to_Term_(gp2term)_Relations

[ ] I propose we filter out annotations that use the 'colocalizes_with' qualifier in the initial steps of the pipeline. This should also be the case for the rat ISO load.

ukemi commented 1 year ago

I have finished going over the 'unique_to_sierra' file and added my analysis results to @sierra-moxon's report: https://docs.google.com/spreadsheets/d/1ziaXNYZZXI_PrLKW0_bs1N6zwZ97pPzEgic1MzI_Cgs/edit#gid=0

I wasn't able to review as many annotations as with the other file, but the results seemed to fall into several categories.

It seems that MGI has a step in place that if there is a 'NOT' annotation in the incoming file, we do not load any annotations between that gene and GO term. This extends to ortholog clusters. I have discussed this rule with both @vanaukenk and @LiNiMGI. Our consensus was that this rule is overzealous. Since 'NOT' annotations are often made for biology that occurs under specific conditions, positive assertions about the role of a gene should not be filtered out because of their existence. The pipeline as it stands should not be modified. It should ignore 'NOT' annotations, but should load ISOs for any positive assertion.

[ ] 2. A second 'NOT' rule SHOULD be put into place for the load is that if there is an experimentally supported 'NOT' annotation in MGI and the import contains an affermative (not 'NOT), assumption we do not load the annotation. This is because the creation of an experimentally supported 'NOT' annotation could show a functional difference between the mouse gene product and the human gene product where the mouse gene product has lost the function/process/component of the ortholog.
1. Many of the annotations in @sierra-moxon' file were filtered in the MGI load because they represent duplicate annotations. Examples are lines 20-23. There was already an ISO annotation in MGI that was made by MGI curators and therefore what would have been a duplicate annotation form the load was suppressed. I know we had discussed that this thype of filtering is difficult, but at some point we need to tackle this issue.

ukemi commented 1 year ago

Conclusion from today's call is that @sierra-moxon will try to deal with detecting the NOT annotations that already exist and suppress the ISOs, but we will leave the duplicates as a bigger GO issue to be dealt with at a later date.

sierra-moxon commented 1 year ago

currently starting to work on detecting NOT annotations that already exist and suppressing ISOs that assert the non-negated form of the annotation.

pgaudet commented 8 months ago

@LiNiMGI Can you check whether this NOT issue has been resolved?

These should be loaded with correct date and assigned_by GOC

Can this ticket be closed?

pgaudet commented 8 months ago

@LiNiMGI to check whether the NOT issue has been resolved.

But in general, where should this occur in the pipeline, should it be a GO rule?

LiNiMGI commented 8 months ago

@pgaudet the first "not " rule (see above) was implemented in the pipeline. the second "not" rule is very hard to do without a database, we will have to hold on it.

We can discuss further with other curators about how to handle inferred "not" annotations in general. For example, GO_REF:0000024 particularly said:Only annotations with an experimental evidence code and which do not have the 'NOT' qualifier are transferred, this is in agreement with the above first "not" rule.

we can close this ticket for now and bring the "not" discussion somewhere else. Thanks, Li

geneontology / pipeline

Add human orthology file to MGI datastream #328