cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Which types of mutation effects should be ignored? #2

Open dhimmel opened 8 years ago

dhimmel commented 8 years ago

The PANCAN_mutation dataset (online doc) contains several types of mutations under the effect column. My processing of the dataset (notebook) yielded the following mutation effect and frequencies (as counts and percentages):

Effect Count Percent
Missense_Mutation 1,044,846 58.152%
Silent 432,995 24.099%
Nonsense_Mutation 81,092 4.513%
RNA 71,493 3.979%
Frame_Shift_Del 46,941 2.613%
Splice_Site 43,262 2.408%
Frame_Shift_Ins 22,546 1.255%
missense_variant 20,241 1.127%
In_Frame_Del 11,455 0.638%
synonymous_variant 7,907 0.440%
Translation_Start_Site 3,258 0.181%
In_Frame_Ins 3,052 0.170%
stop_gained 1,573 0.088%
3_prime_UTR_variant 1,420 0.079%
Nonstop_Mutation 1,318 0.073%
exon_variant 945 0.053%
EXON 420 0.023%
5_prime_UTR_variant 395 0.022%
splice_acceptor_variant 294 0.016%
splice_region_variant 255 0.014%
3'UTR 211 0.012%
splice_donor_variant 203 0.011%
Intron 148 0.008%
5_prime_UTR_premature_start_codon_gain_variant 110 0.006%
NON_SYNONYMOUS_CODING 95 0.005%
INTRAGENIC 57 0.003%
UTR_3_PRIME 38 0.002%
SYNONYMOUS_CODING 36 0.002%
start_lost 32 0.002%
5'UTR 28 0.002%
UTR_5_PRIME 22 0.001%
stop_lost 19 0.001%
IGR 16 0.001%
stop_retained_variant 7 0.000%
STOP_GAINED 6 0.000%
initiator_codon_variant 2 0.000%
SPLICE_SITE_ACCEPTOR 2 0.000%
SYNONYMOUS_STOP 1 0.000%
5'Flank 1 0.000%

It appears that certain effects are duplicates — such as 5_prime_UTR_variant, 5'UTR, UTR_5_PRIME — which if true represents a poor case of standardization. If we want to improve the standardization, we can create our own mapping, or we can report the issue to the upstream creators (although these fixes usually take a long time).

Anyways, we'll have to decide which types of effects to consider as functionally relevant mutations. For example, a "Silent" mutation generally does not have a biological effect. We could also let users decide for themselves, but that adds complexity.

@clairemcleod, @mp8, @DCousminer, @gwaygenomics, @cgreene, @stephenshank — I thought you may have a better understanding than I do of the biology here. Can any of these categories be discarded as irrelevant to a tumor's function and classification? Are you interested in creating a consolidated set of effects with duplicates merged?

clairemcleod commented 8 years ago

Are we interested in preserving mutation type as a data field? If I recall correctly, we were talking about having mutation as a binary outcome variable. If this is still the case, I think there are several ways we could get there. The first would be to parse the above set of effects, potentially eliminating some. I think it would be fine to eliminate the silent mutations category, but am unsure about the others. In the UCSC Xena documentation, they've grouped the mutation effects into four color coded groups - it seems like this might be based on severity, although I am not familiar enough with the topic to be sure. The groups (from here) are:

Red --> Nonsense_Mutation, frameshift_variant, stop_gained, splice_acceptor_variant, splice_acceptor_variant&intron_variant, splice_donor_variant, splice_donor_variant&intron_variant, Splice_Site, Frame_Shift_Del, Frame_Shift_Ins

Blue --> splice_region_variant, splice_region_variant&intron_variant, missense, non_coding_exon_variant, missense_variant, Missense_Mutation, exon_variant, RNA, Indel, start_lost, start_gained, De_novo_Start_OutOfFrame, Translation_Start_Site, De_novo_Start_InFrame, stop_lost, Nonstop_Mutation, initiator_codon_variant, 5_prime_UTR_premature_start_codon_gain_variant, disruptive_inframe_deletion, inframe_deletion, inframe_insertion, In_Frame_Del, In_Frame_Ins

Green --> synonymous_variant, 5_prime_UTR_variant, 3_prime_UTR_variant, 5'Flank, 3'Flank, 3'UTR, 5'UTR, Silent, stop_retained_variant

Orange --> others, SV, upstreamgenevariant, downstream_gene_variant, intron_variant, intergenic_region

A second option would be using the somatic mutation data that is already called at the gene level. Positive mutation calls reflect the effects: nonsense, missense, frame-shif indels, splice site mutations, stop codon readthroughs, change of start codon, and inframe indels. We could also implement this same calling procedure ourselves.

gwaybio commented 8 years ago

Yes, I agree - I think we can toss Silent mutations.

I also think that keeping it simple would be the way to go. There are other resources available that are cleaner/simpler than this data available from TCGA Firehose that may be worth exploring.

cgreene commented 8 years ago

@clairemcleod & @gwaygenomics : If you wanted to provide simple groups that would get people started, how would you combine them? We can always provide the option to drill down to a greater level of detail (e.g. any KRAS G12V mutation), but I agree with you both that a simple initial interface is optimal.

The very granular items will only be useful for mutations that are particularly common.

dhimmel commented 8 years ago

In dhimmel/cancer-data@0239cba786ba775e966434c4f9d01090b30173e6, I changed the download location for UCSC Xena data (and added version tracking). This resolved the unstandardized mutation effect types. The updated version of the frequency table is below (color refers to the Xena characterizations mentioned above):

Effect Count Percent Color
Missense_Mutation 1,132,319 59.504% Blue
Silent 474,679 24.945% Green
Nonsense_Mutation 87,104 4.577% Red
RNA 75,134 3.948% Blue
Frame_Shift_Del 46,991 2.469% Red
Splice_Site 46,477 2.442% Red
Frame_Shift_Ins 21,657 1.138% Red
In_Frame_Del 10,663 0.560% Blue
Translation_Start_Site 3,437 0.181% Blue
In_Frame_Ins 2,685 0.141% Blue
Nonstop_Mutation 1,370 0.072% Blue
3'UTR 211 0.011% Green
Intron 149 0.008% Orange
5'UTR 28 0.001% Green
IGR 16 0.001% Orange
5'Flank 1 0.000% Green

@clairemcleod, nice find with the mutation_bcgsc_gene dataset. This is a gene × sample matrix, which we could transpose to achieve our desired matrix. Unfortunately, this dataset seems to only include 3,219 samples, whereas our processed mutation matrix has 8,499 samples.

dhimmel commented 8 years ago

a simple initial interface is optimal

I went with a simple solution. In dhimmel/cancer-data@ffe66ab26000379adcd7138b8ff39920d4692ef1, I retained only red and blue mutations (according to Xena), meaning orange and green mutations were removed. The only removed mutation effect category that was an appreciable portion of the data was "Silent" -- which I think we're all in agreement should be excluded.

I posted the mutation and expression datasets from this commit to figshare. Mutations were retained for 8,508 samples, 7,706 of which had corresponding expression data.