SuLab / scheduled-bots

GeneWiki Scheduled Bots
MIT License
9 stars 15 forks source link

Double ENSG codes in Wikidata human genes #19

Closed floatingpurr closed 5 years ago

floatingpurr commented 6 years ago

Hello Guys. I hope I'm posting in the right place. I was mapping my local Ensembl IDs to Wikidata when I found some double ENSG codes in Wikidata human genes collection.

For example: Q18035090 and Q30251272 do have same IDs.

Here is the full list.

It seemed strange to me but maybe it's perfectly normal. :wink:

Bye!

stuppie commented 6 years ago

Yes, its because Entrez and Ensembl disagree on these closely related genes. In the example, the two items have different Entrez Gene IDs and so are two separate items. The ensembl mappings to the Entrez ID map to both, and so both have xrefs to the ensembl IDs.

Actually, looking at the primary sources, I'm not sure why this is. The data in Wikidata comes from mygene, which has both ensembl ids listed: https://mygene.info/v3/gene/10168?fields=ensembl But looking at the entrez and ensembl entries: https://www.ncbi.nlm.nih.gov/gene?cmd=retrieve&dopt=default&list_uids=10168 https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000186448;r=3:44584888-44648471 I only see 1-to-1 mappings..

floatingpurr commented 6 years ago

Ok, perfect. Quick recap about what I understood. This kinds of situations:

<gene1> wdt:P594 <ensgA>
<gene2> wdt:P594 <ensgA>

may definitely happen, even if they violate the distinct values constraint of P594.

Regarding the one-to-many mappings differing from sources, I've no clues. In Wikidata, and I guess also in mygene, about the 6% of human genes have one-to-many mappings between Entrez and Ensembl (see here). I do not know if there are other issues like the one you mentioned among those multi-ENSG genes.

In all those cases, from the Wikidata point of view, there is a violation of constraints of P594. This time, it is about the single value constraint. But it happens due to the nature of such data. As you pointed out, the reason is a different logic in genes calls. However, most of human genes seem not to have such a (formal) constraint violation (see here).