Closed dhimmel closed 8 years ago
I'm pretty sure django-genes uses Entrez. I agree that they are generally a bit nicer/more stable than symbols.
We might actually be able to use django-genes here if needed in the models.
For cancer-data
, I think it probably makes sense to work with tables for converting genes, rather than greenelab/django-genes
, as we would wound end up making lot's of API calls.
For conversions such as this, I think it's best to use the mapping that is the inverse of whatever was used by the data creators, due to the ambiguity of gene symbols. I'm waiting to hear back from Xena Browser regarding gene mapping. It's possible, they actually have created mapping files specific to their symbols.
I may be misunderstanding what django-genes
does, but it seems like it may be of the most utility for providing django-cognoma
or the javascript app with additional gene metadata for a small set of identifiers?
I noticed the HiSeqV2
metadata includes a probeMap
attribute with the value /probeMap/hugo_gencode_v24_gtf
. There's a file located at https://tcga.xenahubs.net/download/probeMap/hugo_gencode_v24_gtf
(metadata) whose head is:
id | gene | chrom | chromStart | chromEnd | strand |
---|---|---|---|---|---|
DDX11L1 | DDX11L1 | chr1 | 11869 | 14409 | + |
WASH7P | WASH7P | chr1 | 14404 | 29570 | - |
MIR6859-1 | MIR6859-1 | chr1 | 17369 | 17436 | - |
RP11-34P13.3 | RP11-34P13.3 | chr1 | 29554 | 31109 | + |
MIR1302-2 | MIR1302-2 | chr1 | 30366 | 30503 | + |
FAM138A | FAM138A | chr1 | 34554 | 36081 | - |
Therefore, one possibility for converting symbols in HiSeqV2
to standardized IDs would be to use the genomic location information available in probeMap/hugo_gencode_v24_gtf
. I noticed this file also contained the date-naming issue discussed in #4. Therefore, an the corruption is potentially reversible.
I've just played around with trying to reproduce the gene names available in PANCAN_mutation
by mapping locations to the corresponding hugo_gencode_v24_gtf
Ensembl IDs. For many observations, the Ensembl gene_id
data aren't matching the original gene
names in the PANCAN dataset. It seems like this difference may be due to the update from genome assembly GRCh37 to GRCh38 (i.e. mutation data was potentially labeled using GRCh37, but the gtf file seems based on GRCh38; see example below).
As we ultimately try to integrate the data sets, it seems like it will be important to ensure that we are using a standard reference genome version (ideally whatever version HiSeqV2 was mapped against) .
For example:
sample_id | chromosome | gene (PANCAN_mutation ) |
gene_id (gtf file) | corresponding gene (via Ensembl) | start (PANCAN_mutation ) |
start (gtf) | end (gtf) |
---|---|---|---|---|---|---|---|
TCGA-D8-A1J8-01 | chr10 | A1CF | ENSG00000228651.1 | RP11-556E13.1 | 52,587,953 | 52,556,702 | 52,755,409 |
A1CF location GRCh37p13: Chromosome 10: 52,559,169-52,645,435 reverse strand. A1CF location GRCh38p5: Chromosome 10: 50,799,409-50,885,675 reverse strand.
@clairemcleod good call. It looks like HiSeqV2 is mapped to hg38 while PANCAN_mutation is mapped to hg19.
We can easily update the mutation file to hg38 using a liftover tool but it is definitely important.
re: gene symbols or else I am not aware of 1-to-1 conversion between ensembl ids and either gene symbols or entrez ids. Also I am not aware of correct conversions between hg19 and hg38. A lot of contigs and other previously ambiguous regions have been resolved in hg38. It is definitely recommended for new assemblies or alignments, but as an annotation, I'd recommend that we are more careful and make sure liftover is doing the right thing.
re: gencode annotation If you are merely matching id's for preliminary checks, gencode v19 is the latest update for grch37/hg19.
AFAIC #10 and #12 have addressed this issue. We're now operating entirely using Entrez GeneIDs.
Xena datasets (as retrieved in #1) use symbols to identify genes rather than standardized identifiers, such as Entrez GeneIDs, ensembl gene IDs, HGNC IDs, or UCSC gene IDs. This has led to upstream data quality issues such as #4. Hence, I think it makes sense to code our databases using standardized identifiers.
Currently, we use the
HiSeqV2
andTCGA.PANCAN.sampleMap
datasets which both use symbols. Does anyone have a preferred identifier? I like Entrez GeneIDs.