Closed dhimmel closed 8 years ago
For convenience, here's the direct link to the notebook that processes unc.edu.f72bfbe6-411d-412e-aaab-1a2414e544ec.2146068.rsem.genes.normalized_results
to extract the symbol-ID mapping. And here's the diff for 2.TCGA-process.py
.
I should have mentioned in the commit message that this PR addresses part of #6.
Reviewed, looks good. Based on my understanding of the Google Groups conversation, the Xena datasets should at some point include Entrez IDs natively?
Based on my understanding of the Google Groups conversation, the Xena datasets should at some point include Entrez IDs natively?
@clairemcleod that's what it sounded like. I don't think they're planning to switch to entrez GeneIDs as their primary identifier, but maybe they would make the gene mapping an official output of their pipeline... that was my interpretation.
@clairemcleod would you like to merge? You can do a squash commit since their is only one commit and this will avoid an unnecessary merge commit.
This is a temporary solution based on a mapping provided by @jingchunzhu on the Xena Browser Google Group.
The approach successfully converted all genes in the current
HiSeqV2
(see version info indownload/HiSeqV2.json
) into entrez gene ids.Updates to HiSeqV2 may require a new mapping solution. Since the mapping isn't a permanent item in our processing pipeline, I added to a
mapping/HiSeqV2-genes
directory, rather than directly in the root directory.