Gene names converted to dates in Xena's PANCAN_mutation dataset

dhimmel commented 8 years ago

I've noticed that some gene names have been converted to dates in PANCAN_mutation (version info, Xena Browser). Here are some of the effected rows:

sample	chr	start	end	reference	alt	gene	effect	DNA_VAF	Amino_Acid_Change
TCGA-KK-A8IH-01	chr4	164534558	164534558	G	C	1-Mar	Missense_Mutation	0.320754716981	p.N33K
TCGA-EJ-7125-01	chr16	4829717	4829717	C	A	12-Sep	Missense_Mutation	0.0357142857143	p.R266L
TCGA-CH-5762-01	chr7	55874871	55874871	T	C	14-Sep	Missense_Mutation	0.0251256281407	p.T300A
TCGA-G9-6351-01	chrX	118767429	118767429	C	A	6-Sep	Missense_Mutation	0.0280373831776	p.R328M
TCGA-G9-6342-01	chr5	132098260	132098260	C	A	8-Sep	Missense_Mutation	0.0485436893204	p.M204I

The gene-to-date conversion is a well documented feature of Microsoft Excel. While the number of corrupted rows in PANCAN_mutation looked minimal, it's disturbing that the data has passed through Excel, since workflows that use Excel tend be manual rather than scripted and thus error prone and irreproducible.

dhimmel commented 8 years ago

Mary Goldman from the UCSC Xena Browser team investigated this issue and wrote:

We just checked our files and the gene names that are converted to dates are part of the input MAF data file we got from TCGA DCC, which is from the sequencing center, such as Broad.

As a workaround, we could always remap mutations to genes using genomic location as @clairemcleod began experimenting with. See https://github.com/cognoma/cancer-data/issues/6#issuecomment-233482119.

clairemcleod commented 7 years ago

@dhimmel Do you think it would be better to remap all genes based on genomic location, or simply those that have this identified date issue? I've implemented the liftover procedure @gwaygenomics described above, and will submit that via a pull request soon. Remapping everything seems best from a consistency perspective, but would obviously require more computation/time.

dhimmel commented 7 years ago

Do you think it would be better to remap all genes based on genomic location, or simply those that have this identified date issue?

I like the comprehensive (not patchwork) mapping approach. I do think it will be important to check for consistency with the Xena mapping. In instances where different genes are called, what happened, why, and who's right?

If we find that our mapping seems to have issues, then I would advocate using the Xena gene calls for all resolvable symbols and then remapping by location only the unresolvable symbols.

Computation time is less of a concern -- are we talking minutes (acceptable) or hours (acceptable but not ideal).

clairemcleod commented 7 years ago

The liftover for the whole dataset takes maybe 10-15 min. If we re-map the whole dataset, the part I am more concerned about (and have not yet come up with an efficient way to do) is map a genomic location to an ID. Theoretically we'll have two tables: one with the observed mutation's location and another with IDs and a corresponding location range. Somehow we'll need to merge these tables when the observed location falls within the range.

The other problem I've been running into is finding a source for location/entrez ID mapping. I thought I'd found something useful with UCSC's knownGene and keggEntrez tables, but this actually only contains ~5300 unique Entrez IDs. Are there other resources anyone would recommend for trying to find this mapping?

If we find that our mapping seems to have issues, then I would advocate using the Xena gene calls for all resolvable symbols and then remapping by location only the unresolvable symbols.

Unless solutions to the above issues clearly present themselves, this seems like a good path forward. Even an inelegant solution to the location -> ID mapping should work fine on that scale.

clairemcleod commented 7 years ago

@Inquisitive-Geek New plan (courtesy of @dhimmel): Use a combination of chromosome and gene symbol to map observed mutations to Entrez IDs. Hopefully the combination will be sufficient to resolve most ambiguity. To address the date conversion in Issue #4, we can either use a location based mapping, or backout the gene names that excel could have "translated".

cognoma / cancer-data

Gene names converted to dates in Xena's PANCAN_mutation dataset #4