SuLab / scheduled-bots

GeneWiki Scheduled Bots
MIT License
9 stars 15 forks source link

Duplicated Ensembl IDs #23

Closed floatingpurr closed 5 years ago

floatingpurr commented 5 years ago

Hi guys! I am opening this issue to notify a potential problem that I found in data.

According to this query:

SELECT ?item ?itemLabel ?item2 ?item2Label 
WHERE 
{
  ?item wdt:P594 ?ensg .
  ?item2 wdt:P594 ?ensg .
  FILTER (str(?item) > str(?item2))
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

There are some Ensembl IDs re used across items. It sounds pretty strange.

For example, Q413766 is Fibronectin 1 protein, and Q14819473 is its encoding gene. Both items share ?item wdt:P594 'ENSG00000115414'. AFAIK, ENSG* should be reserved to genes.

Is there something to check in data loading process?

PS: guys at SuLab, please don't hate me too much for my issues submissions 😃

stuppie commented 5 years ago

No problem Andrea, thanks for pointing these out! I haven't looked at all the instances, but at least in this case in particular, it looks like it was added by Tobias1984 in 2013 (link). Theres 54 cases, so I'm guessing its a combination of merges and old statements.. Will take a look at the others

stuppie commented 5 years ago

Looks like some of the others are cases in which a gene has two different Entrez IDs but Ensembl calls it the same gene. Example: https://www.wikidata.org/wiki/Q21821399 https://www.wikidata.org/wiki/Q27107877

floatingpurr commented 5 years ago

I see. The latter is the same case of https://github.com/SuLab/scheduled-bots/issues/19

Regarding Fibronectin-like cases, namely proteins with an ENSG*, I tried the get them all with the following query:

SELECT distinct ?item ?itemLabel
WHERE 
{
  ?item wdt:P594 ?ensg .
  ?item wdt:P31|wdt:P279 wd:Q8054 .
  FILTER NOT EXISTS {?item wdt:P31|wdt:P279 wd:Q7187}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

Turned out there are 2 proteins (i.e., Fibronectin 1 and Myoglobin) with both an ENSG and an ENSMUS identifier as Ensembl Gene ID. The user you mentioned inserted those statements in 2013. The solution is removing those 4 statements.

Regarding CRIP1, someone at 213.96.40.12 marked the item as a protein, but it looks not correct to me, since this is a gene. The solution is removing the statement "instance of protein"

If you agree, I'd proceed as suggested.

stuppie commented 5 years ago

Looks good to me, thanks

floatingpurr commented 5 years ago

You are welcome!

I've just updated those 3 items. Strangely, this query:

SELECT distinct ?item ?itemLabel
WHERE 
{
  ?item wdt:P594 ?ensg .
  ?item wdt:P31|wdt:P279 wd:Q8054 .
  FILTER NOT EXISTS {?item wdt:P31|wdt:P279 wd:Q7187}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

still returns old results. Probably something similar to https://github.com/SuLab/scheduled-bots/issues/22 is going on.

I'm going to close this issue and to report the problem with stale data in Phabricator.

Thanks! 🤙