geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

IEA annotations missing from the GOA mouse file?? #4831

Closed ukemi closed 10 months ago

ukemi commented 11 months ago

As part of the migration of MGI loads to the GOC we will no longer create IEA annotations in-house. Currently we use three methods to create IEAs: SPKW2GO, InterPRo2GO and EC2GO. After the migration to the GOC, we will instead import the IEA annotations that are available in the GOA mouse files: goa_mouse.gaf and goa_mouse_isoform.gaf.

As part of our QC, we compared our current production files with the ones we would be receiving from the GOC via the GOA gafs and we noticed that there were annotations missing from the new file. We think this might point to a problem with the pipelines at GOA??? As an example, the MGI gene Npdc1 (MGI:1099802; UniProtKB:Q64322) has two annotations in the MGI database. One of the annotations is from InterPro2GO and the other is from SPKW2GO. https://www.informatics.jax.org/go/marker/MGI:1099802

When I look at the source file from UniProt, I don't see either of these annotations. I do see other annotations from those methods: https://docs.google.com/spreadsheets/d/1LwwN3RgyGsDQfdggczJ34Qu78XV-WtkB1JtdBOHPZw4

If I look at the UniProt record for the protein: https://www.uniprot.org/uniprotkb/Q64322/entry

I see that it is associated with the keyword 'membrane' and it is associated with InterPro domain IPR009635. Both of these map to 'membrane' (GO:0016020) in the mapping files. The only annotation I see in the incoming file is an annotation to a method we don't yet use at MGI. InterPro:IPR009635 Neural proliferation differentiation control-1 > GO:membrane ; GO:0016020 UniProtKB-KW:KW-0472 Membrane > GO:membrane ; GO:0016020

UniProtKB Q64322 Npdc1 located_in GO:0016020 GO_REF:0000044 IEA UniProtKB-SubCell:SL-0162 C Neural proliferation differentiation and control protein 1 Npdc1|Npdc-1 protein taxon:10090 20230911 UniProt

Is there a reason that the INterPRO2GO and the SPKW2GO annotations are not in the mouse annotation file. Do they get suppressed for some reason or is there bug in generating them?

Thanks for any help!

@LiNiMGI

alexsign commented 11 months ago

@ukemi the GOA database and the QuickGO have all three IEA annotations for Q64322. I'm looking into data unload. The issue seems to be not an issue but, a feature to remove low quality redundant IEA annotations. If MGI needs all IEA annotations, I can modify public release unload procedure, or use alternative file/location. Last release I introduced new goa_mouse_plus.gaf file which has canonical + SwissProt isoforms. I can put full list of GOA mouse annotations there or in another location. Please let me know what you would prefer.

ukemi commented 11 months ago

@alexsign If this is the expected behavior and the 'missing' IEAs are replaced by what is considered a higher quality annotation, then that is fine. We plan to add the missing methods to MGI so we will pick up everything in the linked list above. Once the switch is complete, GOA will be the 'source of truth' for the IEAs and we will rely on the ones coming in with the file. I just wanted to make sure that things were working as expected.

alexsign commented 11 months ago

@ukemi The GOA pipelines does not import any IEAs from the other annotation groups. It will generate IEAs in house every 2 month and publish them in the release files, after some filters as I mention before. If GOA becoming the 'source of truth' do you think I should change the filter, so public release files will have all IEAs in it? Sorry if I misunderstood something, we can have a chat about it later this week.

ukemi commented 11 months ago

@alexsign It makes sense that you don't import them. When you generate them in-house, do you run EC2GO, InterPro2GO and SPKW2GO? If so, then you should be running the annotation pipelines that we run here at MGI. I think the decision on what to include should be decided by the more global curation group, but it sounds like you have thought about it an put a rational approach into place. I can bring this up on a future annotation call for further discussion. Do you have documentation for the kinds of filters that you run? This is a Holiday week here, so I am only working a couple days.

ukemi commented 11 months ago

Can you confirm that this annotation

UniProtKB Q64322 Npdc1 located_in GO:0016020 GO_REF:0000044 IEA UniProtKB-SubCell:SL-0162 C Neural proliferation differentiation and control protein 1 Npdc1|Npdc-1 protein taxon:10090 20230911 UniProt

is in the file because it is thought to be higher quality than the similar annotations from InterPRO2GO and SPKW2GO?

I would have thought your pipeline would generate those in-house and you have maybe filtered them.

alexsign commented 11 months ago

@ukemi yes, you right we generate all first and then filter before release. in the database we have same annotations as in
https://www.ebi.ac.uk/QuickGO/annotations?geneProductId=Q64322

But, for MODs public release data there is a ranking system which was implemented by Tony long ago. There we have 1 for manual annotations and IEAs as follow: 2 UniProtKB-EC 3 UniProtKB-SubCell 4 UniProtKB-UniRule 5 Ensembl orthologs 6 UniProtKB-KW 7 InterPro

so lower number IEA wins, or one which is more specific.

ukemi commented 11 months ago

Thanks @alexsign

This all makes sense!

LiNiMGI commented 10 months ago

@alexsign here is another case: for mouse Defb39 we only see IBA annotations from the GOA mouse file, just wondering do you filter out IEAs if there are IBAs? Thanks, Li

alexsign commented 10 months ago

@LiNiMGI Hi Li, Can I have MGI or UniProt ID for this gene.

pgaudet commented 10 months ago

MGI:2672974

LiNiMGI commented 10 months ago

Thanks @alexsign @pgaudet It's Defb39 MGI:2672974 UniProtKB:Q70KL3 Except IBA annotations, MGI also has IEA annotations through SPKW2GO, InterPRo2GO mapping (see below), which are not in the GOA mouse file.

MGI has IEAs from InterPro protein domain assignments: extracellular region GO:0005576 IEA IPR001855 defense response GO:0006952 IEA IPR001855

MGI also has IEAs from SPKW2GO assignments: extracellular region GO:0005576 IEA KW-0964 defense response GO:0006952 IEA KW-0211 defense response to bacterium GO:0042742. IEA KW-0044

alexsign commented 10 months ago

@LiNiMGI I'v checked all annotations and they were removed because all IEAs have more specific IBA annotation. I have check called "Remove parents in the same GO aspect" which was triggered to remove less specific annotations. It easy to see if you open https://www.ebi.ac.uk/QuickGO/annotations?geneProductId=Q70KL3 and then look at following IBAs assigned GO terms https://www.ebi.ac.uk/QuickGO/term/GO:0002227 https://www.ebi.ac.uk/QuickGO/term/GO:0050829 https://www.ebi.ac.uk/QuickGO/term/GO:0005615

LiNiMGI commented 10 months ago

Thank you @alexsign for clarify it, that's very helpful! Li