biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
113 stars 20 forks source link

Number of xMHC genes returned by MyGene has decreased #127

Closed dhimmel closed 2 years ago

dhimmel commented 2 years ago

In 2020, I shared a MyGene query for detecting human genes in the xMHC region:

https://mygene.info/v3/query?q=hg38.chr6:25,726,063-33,410,226&fields=entrezgene,symbol,type_of_gene&species=human&size=1000&entrezonly=true

We noticed that the number of genes returned has changed from 318 to 286:

# NCBI gene IDs returned by MyGene in 2020-11
genes_old = {23, 177, 199, 534, 629, 696, 717, 720, 721, 780, 1041, 1192, 1302, 1388, 1460, 1589, 1616, 1797, 2550, 2794, 2880, 2968, 3006, 3007, 3008, 3009, 3010, 3012, 3013, 3017, 3018, 3024, 3077, 3105, 3106, 3107, 3108, 3109, 3111, 3112, 3113, 3115, 3116, 3117, 3118, 3119, 3120, 3122, 3123, 3127, 3128, 3133, 3134, 3135, 3139, 3303, 3304, 3305, 3833, 4049, 4050, 4277, 4340, 4439, 4758, 4795, 4855, 5089, 5460, 5514, 5696, 5698, 5863, 5987, 6015, 6046, 6048, 6222, 6257, 6293, 6499, 6568, 6890, 6891, 6892, 6941, 6992, 7124, 7148, 7407, 7718, 7726, 7738, 7741, 7745, 7746, 7916, 7917, 7918, 7919, 7920, 7922, 7923, 7932, 7936, 7940, 8294, 8329, 8330, 8331, 8332, 8334, 8335, 8336, 8339, 8340, 8341, 8342, 8343, 8344, 8345, 8346, 8347, 8348, 8350, 8351, 8352, 8353, 8354, 8355, 8356, 8357, 8358, 8359, 8360, 8361, 8362, 8363, 8364, 8365, 8366, 8367, 8368, 8369, 8449, 8705, 8859, 8870, 8968, 8969, 8970, 9277, 9278, 9374, 9656, 9753, 10050, 10107, 10211, 10246, 10255, 10279, 10384, 10385, 10471, 10473, 10475, 10537, 10554, 10665, 10786, 10866, 10919, 11074, 11118, 11119, 11120, 11270, 23564, 26212, 26529, 26530, 26531, 26692, 26707, 26716, 26797, 26801, 28973, 29113, 29777, 30834, 50854, 54535, 54718, 55937, 56244, 56658, 57176, 57819, 57827, 58496, 58530, 63940, 63943, 64288, 79692, 79897, 79969, 80317, 80345, 80352, 80736, 80737, 80739, 80740, 80741, 80742, 80862, 80863, 80864, 81696, 81697, 81797, 84547, 85235, 85236, 89870, 94026, 114821, 116511, 135644, 135656, 170679, 170680, 170954, 202658, 203068, 221527, 221545, 221613, 222696, 222698, 253018, 255626, 257202, 259197, 259215, 282890, 285830, 285834, 346157, 346171, 352962, 352990, 387032, 387036, 387055, 389376, 394263, 401247, 401250, 401251, 407002, 414760, 414764, 414765, 414777, 414778, 442179, 442184, 442185, 442186, 442191, 442194, 493812, 651302, 677820, 692092, 692199, 692233, 100126314, 100129195, 100133205, 100302242, 100422934, 100507173, 100507362, 100507436, 100507463, 100507547, 100507679, 100616218, 100616230, 100616237, 101928743, 101929006, 101929111, 102060414, 102465500, 102465501, 102465537, 102466190, 102466745, 102466754, 102725019, 102725068, 104533120, 105374988, 105375009, 105375013, 105375014, 105375015, 106478956, 106478957, 106480429, 110599563, 113523636}
# NCBI gene IDs returned by MyGene on 2022-04-19
genes_new = {23, 177, 199, 534, 629, 696, 717, 720, 721, 780, 1192, 1302, 1388, 1460, 1589, 1616, 1797, 2550, 2794, 2880, 3006, 3007, 3008, 3009, 3010, 3013, 3017, 3018, 3024, 3077, 3105, 3106, 3107, 3108, 3109, 3111, 3112, 3113, 3115, 3117, 3118, 3119, 3120, 3122, 3123, 3127, 3128, 3133, 3134, 3135, 3139, 3303, 3304, 3305, 3833, 4049, 4050, 4277, 4340, 4439, 4758, 4795, 4855, 5089, 5460, 5514, 5696, 5698, 5863, 5987, 6046, 6048, 6222, 6257, 6293, 6499, 6568, 6890, 6891, 6892, 6941, 6992, 7124, 7148, 7407, 7718, 7726, 7738, 7741, 7745, 7746, 7916, 7917, 7918, 7919, 7920, 7922, 7923, 7936, 7940, 8294, 8329, 8330, 8331, 8332, 8334, 8335, 8336, 8339, 8340, 8341, 8342, 8343, 8344, 8345, 8346, 8347, 8348, 8350, 8351, 8352, 8353, 8354, 8355, 8356, 8357, 8358, 8359, 8360, 8361, 8362, 8363, 8364, 8365, 8366, 8367, 8368, 8369, 8449, 8705, 8859, 8968, 8969, 8970, 9277, 9278, 9374, 9656, 9753, 10050, 10107, 10211, 10246, 10255, 10279, 10384, 10385, 10471, 10473, 10475, 10537, 10554, 10665, 10786, 10866, 10919, 11074, 11118, 11119, 11120, 11270, 23564, 26212, 26530, 26531, 26692, 26716, 26797, 26801, 28973, 29113, 30834, 50854, 54535, 54718, 55937, 56244, 56658, 57176, 57827, 58496, 58530, 63940, 63943, 64288, 65944, 79692, 79897, 79969, 80317, 80345, 80352, 80736, 80737, 80739, 80740, 80742, 80862, 80863, 81696, 81697, 81797, 84547, 85235, 85236, 89870, 114821, 116511, 135644, 135656, 170679, 170954, 202658, 203068, 221527, 221545, 221613, 222696, 222698, 253018, 255626, 257202, 259197, 259215, 282890, 285830, 285834, 346157, 346171, 352962, 352990, 387032, 387036, 387055, 389376, 401247, 401251, 407002, 414760, 414764, 414765, 414777, 414778, 442179, 442184, 442185, 493812, 677820, 692092, 692199, 692233, 100126314, 100129195, 100133205, 100302242, 100422934, 100507173, 100507362, 100507436, 100507463, 100507547, 100507679, 100616218, 101928743, 102060414, 102465500, 102465501, 102466190, 102466745, 102725019, 102725068, 105375009, 105375013, 105379695, 110599563, 113523636}
# genes that have been added to results
>>> genes_new - genes_old
{65944, 105379695}
# genes that have been removed from results
>>> genes_old - genes_new
{1041, 2968, 3012, 3116, 6015, 7932, 8870, 26529, 26707, 29777, 57819, 80741, 80864, 94026, 170680, 394263, 401250, 442186, 442191, 442194, 651302, 100616230, 100616237, 101929006, 101929111, 102465537, 102466754, 104533120, 105374988, 105375014, 105375015, 106478956, 106478957, 106480429}

Taking a look at the first gene no longer returned by the query: https://www.ncbi.nlm.nih.gov/gene/1041 (CDSN), it appears that it should be a gene located in the xMHC.

Any idea what could have caused this change? Wanted to alert you in case some issue has occurred somewhere in the genome stack!

newgene commented 2 years ago

Hi @dhimmel Thanks for reporting this to us!

I believe this is related to the recent Ensembl release 106, which we recently synchronized with. It appears that in Ensembl new release, the mappings from Ensembl Gene to NCBI Gene ids are lost for quite a few genes. There is an example from your list, 1014, we now have two separate gene records.

http://mygene.info/v3/gene/1041 http://mygene.info/v3/gene/ENSG00000204539

We can verify from this BioMart live query (need to click the "Results" to view the output, also see the screenshot below), ENSG00000204539 is no longer mapped to NCBI Gene 1041.

image

Because the genomic_pos field from MyGene.info is from Ensembl, unless Ensembl provides the mapping between ENSG00000204539 and 1014, gene object 1041 won't have this field to be included in a genomic range query.

I did not check other genes, but I suspect that's the reason these genes are currently missing.

We also received another user report about this missing mapping between Ensembl and NCBI Genes. We have already filed a ticket to Ensembl helpdesk. Let's see if it can be resolved soon. Otherwise, we can potentially rollback to an earlier version for now.

dhimmel commented 2 years ago

Oh my, this is bad! Let us know when the Ensembl helpdesk gets back to you.

Noting that I independently confirmed with the homo_sapiens_core_106_38 output from https://github.com/related-sciences/ensembl-genes that the ENSG00000204539 to ncbigene:1041 mapping is missing.

I think I'm going to revert my pipelines to Ensembl release 105 while this sorts itself out.

dhimmel commented 2 years ago

A quick look at the lost ncbi xrefs for human genes. Compared to release 105, 106 added ncbi xrefs for 73 genes and removed ncbi xrefs for 2832 genes. Here's a comparison of all NCBI gene xrefs for humans between 105 and 106: ensembl-genes-human-ncbi-xrefs-105-to-106.xlsx.

Python code to generate analysis:

import python
# homo_sapiens_core_105_38
url = "https://github.com/related-sciences/ensembl-genes/raw/5581d3dc9085eb947f41572b05118497128ff80b/xref_ncbigene.json.gz"
xrefs_105 = pandas.read_json(url).set_index("ensembl_representative_gene_id").add_suffix("_105")
# homo_sapiens_core_106_38
url = "https://github.com/related-sciences/ensembl-genes/raw/753aa431e0b6ded055843720f7778f6bb17757a5/xref_ncbigene.json.gz"
xrefs_106 = pandas.read_json(url).set_index("ensembl_representative_gene_id").add_suffix("_106")
xrefs = xrefs_105.join(xrefs_106, how="outer").convert_dtypes()
xrefs.head()
xrefs[["ncbigene_id_105", "ncbigene_id_106"]].isna().sum()
xrefs.to_excel("ensembl-genes-human-ncbi-xrefs-105-to-106.xlsx")
newgene commented 2 years ago

@dhimmel we just released a new version with the fix to this issue. Please confirm if the returned gene hits are expected.

Ensembl helpdesk replied us last week saying that they will investigate this issue, so looks like the fixes (or tell us that's expected) will take a while.

We now modified our algorithm to recover those missing ensemblgene-ncbigene mappings based on NCBI's gene2ensembl mapping file. Only those 1-1 mapping which are missing in Ensembl's xrefs and the ncbi gene ids were not mapped to any other Ensembl gene ids are recovered. The specific case above ENSG00000204539<->1041 mapping is now recovered in MyGene.info. Both links below now return the exactly same gene object:

http://mygene.info/v3/gene/1041 http://mygene.info/v3/gene/ENSG00000204539

We previously already used gene2ensembl to fix those 1-m mapping from Ensembl (means if gene2ensembl provides unique 1-1 mapping, we will take the unique mapping from gene2ensembl), so it's relatively easy for us to add additional check to recover those missing unique mappings. The specific commit is here if you are interested: https://github.com/biothings/mygene.info/commit/00c9b4704e77d2197d17e571579ed91078cb3af0.

dhimmel commented 2 years ago

Sounds like a good work around using the NCBI provided mapping! I'll close this issue since it's fixed by MyGene, but let's continue to post here any updates from ensembl or insights into why their mapping changed.

Have they provided any more details to you?

Ben-Ensembl commented 2 years ago

Hi Chunlei,

Thank you for your patience whilst we continued to investigate this issue.

Hi dhimmel,

In 106 several (NCBI, MIM, Wikigene) Xrefs are missing. This has been fixed for Ensembl 107 which will be released in early July. The release will be announced through the Ensembl mailing lists, social media and blog:

https://www.ensembl.info/

Best wishes

Ben

Ensembl Helpdesk

dhimmel commented 2 years ago

In 106 several (NCBI, MIM, Wikigene) Xrefs are missing. This has been fixed for Ensembl 107 which will be released in early July

Thanks @Ben-Ensembl for the info, much appreciated! By the way, is the fix something that is in any of the public Ensembl repositories? I.e. is there a commit we could reference to learn more about the fix and or problem? Am curious not only for this issue but also generally as a way for me to keep up to date with the actual source code / data ingestion changes to Ensembl.

Ben-Ensembl commented 2 years ago

No problem, @dhimmel. We could not able to perform a full-fledged Root Cause Analysis because the environment was not available anymore but since the underlying code did not change between Ensembl 105 and 106, we believe the issue was likely caused by the environment conditions when the Ensembl Xref pipeline was run for Ensembl 106.