arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
317 stars 119 forks source link

Fix, update and automate generation of detailed_gene and summary_gene tables #913

Closed pfpjs closed 5 years ago

pfpjs commented 5 years ago

This fixes the generation of detailed_gene and summary_gene tables from Ensembl BioMart (v95 as of this update). Addresses #902 and #912. Some notes:

brentp commented 5 years ago

can you upload detailed_gene_table_v95 and the summary table somewhere I can download them?

pfpjs commented 5 years ago

There you go: https://drive.google.com/open?id=1q5geCRd0EPqGJcJQ0j2D_V0IO7Z0Q3vq

brentp commented 5 years ago

the tests don't pass for me with these updates. I am looking into it, but any insight would be great.

brentp commented 5 years ago

I see, it has, e.g. PKCα which sqlalchemy does not like. I'll get this working.

pfpjs commented 5 years ago

Whoops, Unicode characters in the HGNC aliases or synonyms are messing it up.

Using the following command: iconv -f UTF-8 -t ASCII//TRANSLIT summary_gene_table_v95 seems to sanitize the file, but then there could be messed up gene synonyms and aliases.

The best would be to first convert the file HGNC_download and regenerate everything. I'm currently testing that, will let you know how it goes.

brentp commented 5 years ago

I got it fixed inside of gemini. Don't worry about it! Thanks again!

brentp commented 5 years ago

I will have to revert this. The coordinates for for hg38.

pfpjs commented 5 years ago

Major whoops! I think I've fixed it (changed www.ensembl.org to grch37.ensembl.org basically) and opened another PR. Sorry again!

brentp commented 5 years ago

We are going to release 0.30.0 as the next version and then I'll get this in after that. Thanks so much for figuring it out and updating the PR. I just want to get this out and then do smaller updates from here.