Closed stuppie closed 7 years ago
Merged > 1000 genes and proteins https://github.com/SuLab/scheduled-bots/commit/6ccdf23c67af632a46017d63b5f51d2c207be0ab
There are still some left that need manual intervention. Hundreds were cause by someone from russian wikipedia importing infoboxes adding uniprot IDs onto genes and entrez ids onto proteins
Excuse me If I come back to this topic.
I'm trying to import all genes of Homo Sapiens with related proteins and other bio data. If I run a simple query , almost 60k
genes are returned. It happens probably due to the bad import you've mentioned.
To get rid of the duplicates, I added into a new query the condition ?item wdt:P1057 ?chr .
to filter against the chromosome on which an entity is localized. In that case 25.722
genes are returned but if I query for distinct values I get only 25714
genes. It happens because some genes have more than one chromosome (P1057) property (Can it actually happen?).
I don't know if that is the proper way to fetch data. Probably, I could also filter against genomic coordinates (start/end). But If I did that, I would not get data like CALM2 (Q17855536).
Later, if I followed encodes
properties, could I get RNA/proteins data or bad imports from russian wikidata messed them up?
I don't know other good ways to get clean bio data...
some genes have more than one localization, as the sequence is completely the same on several chromosomes. These elements are frequently of the non-protein coding gene type, e.g. microRNAs.
the genes without chromosomal localization are very often pseudogenes/non-coding gens, so if you retrieve all human genes, about half will be such genes. You can filter for all genes w/o chromosome information http://tinyurl.com/ycpt2c2d .
Thank you @sebotic! Therefore, having almost 60k
human genes it's correct!
As previously mentioned, there are still few doubled genes.
Are you going to add genomic localization also to pseudogenes/non-coding data in the future, where available?
There's only 8 duplicated genes left: http://tinyurl.com/y9elxet8 . It looks like these are linked to Wikipedia pages and need a little work to merge properly. Feel free to take a stab at it!
Thanks for your remark @stuppie. Before doing that, I need to understand the wikidata merging process and its workflow. An unexplored world to me : )
http://tinyurl.com/yc99rp76
Not all of these are duplicates, some (mostly the human ones) are things other people have incorrectly added to the protein
https://bitbucket.org/sulab/wikidatabots/issues/90/duplicate-human-genes