cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Extract gene information #30

Closed dhimmel closed 8 years ago

dhimmel commented 8 years ago

Closes https://github.com/cognoma/cancer-data/issues/23

This downloads the latest Entrez Gene information from their FTP site (updated daily). Obsoleted genes have missing values for the columns from Entrez Gene. Unclear how we want to proceed wrt making the backing/django-genes and cancer-data use the same gene data.

dhimmel commented 8 years ago

Here's the head of genes.tsv, which would become part of the Cognoma data release:

entrez_gene_id symbol description chromosome gene_type synonyms aliases n_mutations mutation_frequency mean_expression mutation expression
1 A1BG alpha-1-B glycoprotein 19 protein-coding A1B ABG GAB HYST2477 alpha-1B-glycoprotein HEL-S-163pA epididymis secretory sperm binding protein Li 163pA 30 0.004106 6.71 1 1
2 A2M alpha-2-macroglobulin 12 protein-coding A2MD CPAMD5 FWP007 S863-7 alpha-2-macroglobulin C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5 alpha-2-M 130 0.01779 13.34 1 1
3 A2MP1 alpha-2-macroglobulin pseudogene 1 12 pseudo A2MP pregnancy-zone protein pseudogene 4 0.0005475 1 0
9 NAT1 N-acetyltransferase 1 8 protein-coding AAC1 MNAT NAT-1 NATI arylamine N-acetyltransferase 1 N-acetyltransferase 1 (arylamine N-acetyltransferase) N-acetyltransferase type 1 arylamide acetylase 1 monomorphic arylamine N-acetyltransferase 17 0.002327 6.729 1 1
10 NAT2 N-acetyltransferase 2 8 protein-coding AAC2 NAT-2 PNAT arylamine N-acetyltransferase 2 N-acetyltransferase 2 (arylamine N-acetyltransferase) N-acetyltransferase type 2 arylamide acetylase 2 26 0.003559 2.086 1 1
dhimmel commented 8 years ago

Do not review yet --- will update in the wake of https://github.com/cognoma/genes/pull/1.