Closed floatingpurr closed 5 years ago
No problem Andrea, thanks for pointing these out! I haven't looked at all the instances, but at least in this case in particular, it looks like it was added by Tobias1984 in 2013 (link). Theres 54 cases, so I'm guessing its a combination of merges and old statements.. Will take a look at the others
Looks like some of the others are cases in which a gene has two different Entrez IDs but Ensembl calls it the same gene. Example: https://www.wikidata.org/wiki/Q21821399 https://www.wikidata.org/wiki/Q27107877
I see. The latter is the same case of https://github.com/SuLab/scheduled-bots/issues/19
Regarding Fibronectin-like cases, namely proteins with an ENSG*, I tried the get them all with the following query:
SELECT distinct ?item ?itemLabel
WHERE
{
?item wdt:P594 ?ensg .
?item wdt:P31|wdt:P279 wd:Q8054 .
FILTER NOT EXISTS {?item wdt:P31|wdt:P279 wd:Q7187}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Turned out there are 2 proteins (i.e., Fibronectin 1 and Myoglobin) with both an ENSG and an ENSMUS identifier as Ensembl Gene ID. The user you mentioned inserted those statements in 2013. The solution is removing those 4 statements.
Regarding CRIP1, someone at 213.96.40.12
marked the item as a protein, but it looks not correct to me, since this is a gene. The solution is removing the statement "instance of protein"
If you agree, I'd proceed as suggested.
Looks good to me, thanks
You are welcome!
I've just updated those 3 items. Strangely, this query:
SELECT distinct ?item ?itemLabel
WHERE
{
?item wdt:P594 ?ensg .
?item wdt:P31|wdt:P279 wd:Q8054 .
FILTER NOT EXISTS {?item wdt:P31|wdt:P279 wd:Q7187}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
still returns old results. Probably something similar to https://github.com/SuLab/scheduled-bots/issues/22 is going on.
I'm going to close this issue and to report the problem with stale data in Phabricator.
Thanks! 🤙
Hi guys! I am opening this issue to notify a potential problem that I found in data.
According to this query:
Try it!
There are some Ensembl IDs re used across items. It sounds pretty strange.
For example,
Q413766
is Fibronectin 1 protein, andQ14819473
is its encoding gene. Both items share?item wdt:P594 'ENSG00000115414'
. AFAIK, ENSG* should be reserved to genes.Is there something to check in data loading process?
PS: guys at SuLab, please don't hate me too much for my issues submissions 😃