cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Gene names converted to dates in Xena's PANCAN_mutation dataset #4

Open dhimmel opened 8 years ago

dhimmel commented 8 years ago

I've noticed that some gene names have been converted to dates in PANCAN_mutation (version info, Xena Browser). Here are some of the effected rows:

sample chr start end reference alt gene effect DNA_VAF RNA_VAF Amino_Acid_Change
TCGA-KK-A8IH-01 chr4 164534558 164534558 G C 1-Mar Missense_Mutation 0.320754716981 p.N33K
TCGA-EJ-7125-01 chr16 4829717 4829717 C A 12-Sep Missense_Mutation 0.0357142857143 p.R266L
TCGA-CH-5762-01 chr7 55874871 55874871 T C 14-Sep Missense_Mutation 0.0251256281407 p.T300A
TCGA-G9-6351-01 chrX 118767429 118767429 C A 6-Sep Missense_Mutation 0.0280373831776 p.R328M
TCGA-G9-6342-01 chr5 132098260 132098260 C A 8-Sep Missense_Mutation 0.0485436893204 p.M204I

The gene-to-date conversion is a well documented feature of Microsoft Excel. While the number of corrupted rows in PANCAN_mutation looked minimal, it's disturbing that the data has passed through Excel, since workflows that use Excel tend be manual rather than scripted and thus error prone and irreproducible.

dhimmel commented 8 years ago

Mary Goldman from the UCSC Xena Browser team investigated this issue and wrote:

We just checked our files and the gene names that are converted to dates are part of the input MAF data file we got from TCGA DCC, which is from the sequencing center, such as Broad.

As a workaround, we could always remap mutations to genes using genomic location as @clairemcleod began experimenting with. See https://github.com/cognoma/cancer-data/issues/6#issuecomment-233482119.

clairemcleod commented 7 years ago

@dhimmel Do you think it would be better to remap all genes based on genomic location, or simply those that have this identified date issue? I've implemented the liftover procedure @gwaygenomics described above, and will submit that via a pull request soon. Remapping everything seems best from a consistency perspective, but would obviously require more computation/time.

dhimmel commented 7 years ago

Do you think it would be better to remap all genes based on genomic location, or simply those that have this identified date issue?

I like the comprehensive (not patchwork) mapping approach. I do think it will be important to check for consistency with the Xena mapping. In instances where different genes are called, what happened, why, and who's right?

If we find that our mapping seems to have issues, then I would advocate using the Xena gene calls for all resolvable symbols and then remapping by location only the unresolvable symbols.

Computation time is less of a concern -- are we talking minutes (acceptable) or hours (acceptable but not ideal).

clairemcleod commented 7 years ago

The liftover for the whole dataset takes maybe 10-15 min. If we re-map the whole dataset, the part I am more concerned about (and have not yet come up with an efficient way to do) is map a genomic location to an ID. Theoretically we'll have two tables: one with the observed mutation's location and another with IDs and a corresponding location range. Somehow we'll need to merge these tables when the observed location falls within the range.

The other problem I've been running into is finding a source for location/entrez ID mapping. I thought I'd found something useful with UCSC's knownGene and keggEntrez tables, but this actually only contains ~5300 unique Entrez IDs. Are there other resources anyone would recommend for trying to find this mapping?

If we find that our mapping seems to have issues, then I would advocate using the Xena gene calls for all resolvable symbols and then remapping by location only the unresolvable symbols.

Unless solutions to the above issues clearly present themselves, this seems like a good path forward. Even an inelegant solution to the location -> ID mapping should work fine on that scale.

clairemcleod commented 7 years ago

@Inquisitive-Geek New plan (courtesy of @dhimmel): Use a combination of chromosome and gene symbol to map observed mutations to Entrez IDs. Hopefully the combination will be sufficient to resolve most ambiguity. To address the date conversion in Issue #4, we can either use a location based mapping, or backout the gene names that excel could have "translated".