Converting Xena datasets to standard identifiers rather than gene symbols

dhimmel commented 8 years ago

Xena datasets (as retrieved in #1) use symbols to identify genes rather than standardized identifiers, such as Entrez GeneIDs, ensembl gene IDs, HGNC IDs, or UCSC gene IDs. This has led to upstream data quality issues such as #4. Hence, I think it makes sense to code our databases using standardized identifiers.

Currently, we use the HiSeqV2 and TCGA.PANCAN.sampleMap datasets which both use symbols. Does anyone have a preferred identifier? I like Entrez GeneIDs.

cgreene commented 8 years ago

I'm pretty sure django-genes uses Entrez. I agree that they are generally a bit nicer/more stable than symbols.

We might actually be able to use django-genes here if needed in the models.

dhimmel commented 8 years ago

For cancer-data, I think it probably makes sense to work with tables for converting genes, rather than greenelab/django-genes, as we would wound end up making lot's of API calls.

For conversions such as this, I think it's best to use the mapping that is the inverse of whatever was used by the data creators, due to the ambiguity of gene symbols. I'm waiting to hear back from Xena Browser regarding gene mapping. It's possible, they actually have created mapping files specific to their symbols.

I may be misunderstanding what django-genes does, but it seems like it may be of the most utility for providing django-cognoma or the javascript app with additional gene metadata for a small set of identifiers?

dhimmel commented 8 years ago

I noticed the HiSeqV2metadata includes a probeMap attribute with the value /probeMap/hugo_gencode_v24_gtf. There's a file located at https://tcga.xenahubs.net/download/probeMap/hugo_gencode_v24_gtf (metadata) whose head is:

id	gene	chrom	chromStart	chromEnd	strand
DDX11L1	DDX11L1	chr1	11869	14409	+
WASH7P	WASH7P	chr1	14404	29570	-
MIR6859-1	MIR6859-1	chr1	17369	17436	-
RP11-34P13.3	RP11-34P13.3	chr1	29554	31109	+
MIR1302-2	MIR1302-2	chr1	30366	30503	+
FAM138A	FAM138A	chr1	34554	36081	-

Therefore, one possibility for converting symbols in HiSeqV2 to standardized IDs would be to use the genomic location information available in probeMap/hugo_gencode_v24_gtf. I noticed this file also contained the date-naming issue discussed in #4. Therefore, an the corruption is potentially reversible.

clairemcleod commented 8 years ago

I've just played around with trying to reproduce the gene names available in PANCAN_mutation by mapping locations to the corresponding hugo_gencode_v24_gtf Ensembl IDs. For many observations, the Ensembl gene_id data aren't matching the original gene names in the PANCAN dataset. It seems like this difference may be due to the update from genome assembly GRCh37 to GRCh38 (i.e. mutation data was potentially labeled using GRCh37, but the gtf file seems based on GRCh38; see example below).

As we ultimately try to integrate the data sets, it seems like it will be important to ensure that we are using a standard reference genome version (ideally whatever version HiSeqV2 was mapped against) .

For example:

sample_id	chromosome	gene (`PANCAN_mutation`)	gene_id (gtf file)	corresponding gene (via Ensembl)	start (`PANCAN_mutation`)	start (gtf)	end (gtf)
TCGA-D8-A1J8-01	chr10	A1CF	ENSG00000228651.1	RP11-556E13.1	52,587,953	52,556,702	52,755,409

A1CF location GRCh37p13: Chromosome 10: 52,559,169-52,645,435 reverse strand. A1CF location GRCh38p5: Chromosome 10: 50,799,409-50,885,675 reverse strand.

gwaybio commented 8 years ago

@clairemcleod good call. It looks like HiSeqV2 is mapped to hg38 while PANCAN_mutation is mapped to hg19.

We can easily update the mutation file to hg38 using a liftover tool but it is definitely important.

ypar commented 8 years ago

re: gene symbols or else I am not aware of 1-to-1 conversion between ensembl ids and either gene symbols or entrez ids. Also I am not aware of correct conversions between hg19 and hg38. A lot of contigs and other previously ambiguous regions have been resolved in hg38. It is definitely recommended for new assemblies or alignments, but as an annotation, I'd recommend that we are more careful and make sure liftover is doing the right thing.

re: gencode annotation If you are merely matching id's for preliminary checks, gencode v19 is the latest update for grch37/hg19.

dhimmel commented 8 years ago

AFAIC #10 and #12 have addressed this issue. We're now operating entirely using Entrez GeneIDs.

cognoma / cancer-data

Converting Xena datasets to standard identifiers rather than gene symbols #6