SuLab / GeneWikiCentral

GeneWiki Organization
MIT License
5 stars 2 forks source link

Duplicate genes #44

Closed stuppie closed 7 years ago

stuppie commented 7 years ago

http://tinyurl.com/yc99rp76

Not all of these are duplicates, some (mostly the human ones) are things other people have incorrectly added to the protein

https://bitbucket.org/sulab/wikidatabots/issues/90/duplicate-human-genes

stuppie commented 7 years ago

Merged > 1000 genes and proteins https://github.com/SuLab/scheduled-bots/commit/6ccdf23c67af632a46017d63b5f51d2c207be0ab

There are still some left that need manual intervention. Hundreds were cause by someone from russian wikipedia importing infoboxes adding uniprot IDs onto genes and entrez ids onto proteins

floatingpurr commented 7 years ago

Excuse me If I come back to this topic.

I'm trying to import all genes of Homo Sapiens with related proteins and other bio data. If I run a simple query , almost 60k genes are returned. It happens probably due to the bad import you've mentioned.

To get rid of the duplicates, I added into a new query the condition ?item wdt:P1057 ?chr . to filter against the chromosome on which an entity is localized. In that case 25.722 genes are returned but if I query for distinct values I get only 25714 genes. It happens because some genes have more than one chromosome (P1057) property (Can it actually happen?).

I don't know if that is the proper way to fetch data. Probably, I could also filter against genomic coordinates (start/end). But If I did that, I would not get data like CALM2 (Q17855536).

Later, if I followed encodes properties, could I get RNA/proteins data or bad imports from russian wikidata messed them up?

I don't know other good ways to get clean bio data...

sebotic commented 7 years ago

some genes have more than one localization, as the sequence is completely the same on several chromosomes. These elements are frequently of the non-protein coding gene type, e.g. microRNAs.

the genes without chromosomal localization are very often pseudogenes/non-coding gens, so if you retrieve all human genes, about half will be such genes. You can filter for all genes w/o chromosome information http://tinyurl.com/ycpt2c2d .

floatingpurr commented 7 years ago

Thank you @sebotic! Therefore, having almost 60k human genes it's correct!

As previously mentioned, there are still few doubled genes.

Are you going to add genomic localization also to pseudogenes/non-coding data in the future, where available?

stuppie commented 7 years ago

There's only 8 duplicated genes left: http://tinyurl.com/y9elxet8 . It looks like these are linked to Wikipedia pages and need a little work to merge properly. Feel free to take a stab at it!

floatingpurr commented 7 years ago

Thanks for your remark @stuppie. Before doing that, I need to understand the wikidata merging process and its workflow. An unexplored world to me : )