cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Map HiSeqV2 symbols to entrez gene IDs #8

Closed dhimmel closed 8 years ago

dhimmel commented 8 years ago

This is a temporary solution based on a mapping provided by @jingchunzhu on the Xena Browser Google Group.

The approach successfully converted all genes in the current HiSeqV2 (see version info in download/HiSeqV2.json) into entrez gene ids.

Updates to HiSeqV2 may require a new mapping solution. Since the mapping isn't a permanent item in our processing pipeline, I added to a mapping/HiSeqV2-genes directory, rather than directly in the root directory.

dhimmel commented 8 years ago

For convenience, here's the direct link to the notebook that processes unc.edu.f72bfbe6-411d-412e-aaab-1a2414e544ec.2146068.rsem.genes.normalized_results to extract the symbol-ID mapping. And here's the diff for 2.TCGA-process.py.

dhimmel commented 8 years ago

I should have mentioned in the commit message that this PR addresses part of #6.

clairemcleod commented 8 years ago

Reviewed, looks good. Based on my understanding of the Google Groups conversation, the Xena datasets should at some point include Entrez IDs natively?

dhimmel commented 8 years ago

Based on my understanding of the Google Groups conversation, the Xena datasets should at some point include Entrez IDs natively?

@clairemcleod that's what it sounded like. I don't think they're planning to switch to entrez GeneIDs as their primary identifier, but maybe they would make the gene mapping an official output of their pipeline... that was my interpretation.

@clairemcleod would you like to merge? You can do a squash commit since their is only one commit and this will avoid an unnecessary merge commit.