Which types of mutation effects should be ignored?

dhimmel commented 8 years ago

The PANCAN_mutation dataset (online doc) contains several types of mutations under the effect column. My processing of the dataset (notebook) yielded the following mutation effect and frequencies (as counts and percentages):

Effect	Count	Percent
Missense_Mutation	1,044,846	58.152%
Silent	432,995	24.099%
Nonsense_Mutation	81,092	4.513%
RNA	71,493	3.979%
Frame_Shift_Del	46,941	2.613%
Splice_Site	43,262	2.408%
Frame_Shift_Ins	22,546	1.255%
missense_variant	20,241	1.127%
In_Frame_Del	11,455	0.638%
synonymous_variant	7,907	0.440%
Translation_Start_Site	3,258	0.181%
In_Frame_Ins	3,052	0.170%
stop_gained	1,573	0.088%
3_prime_UTR_variant	1,420	0.079%
Nonstop_Mutation	1,318	0.073%
exon_variant	945	0.053%
EXON	420	0.023%
5_prime_UTR_variant	395	0.022%
splice_acceptor_variant	294	0.016%
splice_region_variant	255	0.014%
3'UTR	211	0.012%
splice_donor_variant	203	0.011%
Intron	148	0.008%
5_prime_UTR_premature_start_codon_gain_variant	110	0.006%
NON_SYNONYMOUS_CODING	95	0.005%
INTRAGENIC	57	0.003%
UTR_3_PRIME	38	0.002%
SYNONYMOUS_CODING	36	0.002%
start_lost	32	0.002%
5'UTR	28	0.002%
UTR_5_PRIME	22	0.001%
stop_lost	19	0.001%
IGR	16	0.001%
stop_retained_variant	7	0.000%
STOP_GAINED	6	0.000%
initiator_codon_variant	2	0.000%
SPLICE_SITE_ACCEPTOR	2	0.000%
SYNONYMOUS_STOP	1	0.000%
5'Flank	1	0.000%

It appears that certain effects are duplicates — such as 5_prime_UTR_variant, 5'UTR, UTR_5_PRIME — which if true represents a poor case of standardization. If we want to improve the standardization, we can create our own mapping, or we can report the issue to the upstream creators (although these fixes usually take a long time).

Anyways, we'll have to decide which types of effects to consider as functionally relevant mutations. For example, a "Silent" mutation generally does not have a biological effect. We could also let users decide for themselves, but that adds complexity.

@clairemcleod, @mp8, @DCousminer, @gwaygenomics, @cgreene, @stephenshank — I thought you may have a better understanding than I do of the biology here. Can any of these categories be discarded as irrelevant to a tumor's function and classification? Are you interested in creating a consolidated set of effects with duplicates merged?

clairemcleod commented 8 years ago

Are we interested in preserving mutation type as a data field? If I recall correctly, we were talking about having mutation as a binary outcome variable. If this is still the case, I think there are several ways we could get there. The first would be to parse the above set of effects, potentially eliminating some. I think it would be fine to eliminate the silent mutations category, but am unsure about the others. In the UCSC Xena documentation, they've grouped the mutation effects into four color coded groups - it seems like this might be based on severity, although I am not familiar enough with the topic to be sure. The groups (from here) are:

Red --> Nonsense_Mutation, frameshift_variant, stop_gained, splice_acceptor_variant, splice_acceptor_variant&intron_variant, splice_donor_variant, splice_donor_variant&intron_variant, Splice_Site, Frame_Shift_Del, Frame_Shift_Ins

Blue --> splice_region_variant, splice_region_variant&intron_variant, missense, non_coding_exon_variant, missense_variant, Missense_Mutation, exon_variant, RNA, Indel, start_lost, start_gained, De_novo_Start_OutOfFrame, Translation_Start_Site, De_novo_Start_InFrame, stop_lost, Nonstop_Mutation, initiator_codon_variant, 5_prime_UTR_premature_start_codon_gain_variant, disruptive_inframe_deletion, inframe_deletion, inframe_insertion, In_Frame_Del, In_Frame_Ins

Green --> synonymous_variant, 5_prime_UTR_variant, 3_prime_UTR_variant, 5'Flank, 3'Flank, 3'UTR, 5'UTR, Silent, stop_retained_variant

Orange --> others, SV, upstreamgenevariant, downstream_gene_variant, intron_variant, intergenic_region

A second option would be using the somatic mutation data that is already called at the gene level. Positive mutation calls reflect the effects: nonsense, missense, frame-shif indels, splice site mutations, stop codon readthroughs, change of start codon, and inframe indels. We could also implement this same calling procedure ourselves.

gwaybio commented 8 years ago

Yes, I agree - I think we can toss Silent mutations.

I also think that keeping it simple would be the way to go. There are other resources available that are cleaner/simpler than this data available from TCGA Firehose that may be worth exploring.

cgreene commented 8 years ago

@clairemcleod & @gwaygenomics : If you wanted to provide simple groups that would get people started, how would you combine them? We can always provide the option to drill down to a greater level of detail (e.g. any KRAS G12V mutation), but I agree with you both that a simple initial interface is optimal.

The very granular items will only be useful for mutations that are particularly common.

dhimmel commented 8 years ago

In dhimmel/cancer-data@0239cba786ba775e966434c4f9d01090b30173e6, I changed the download location for UCSC Xena data (and added version tracking). This resolved the unstandardized mutation effect types. The updated version of the frequency table is below (color refers to the Xena characterizations mentioned above):

Effect	Count	Percent	Color
Missense_Mutation	1,132,319	59.504%	Blue
Silent	474,679	24.945%	Green
Nonsense_Mutation	87,104	4.577%	Red
RNA	75,134	3.948%	Blue
Frame_Shift_Del	46,991	2.469%	Red
Splice_Site	46,477	2.442%	Red
Frame_Shift_Ins	21,657	1.138%	Red
In_Frame_Del	10,663	0.560%	Blue
Translation_Start_Site	3,437	0.181%	Blue
In_Frame_Ins	2,685	0.141%	Blue
Nonstop_Mutation	1,370	0.072%	Blue
3'UTR	211	0.011%	Green
Intron	149	0.008%	Orange
5'UTR	28	0.001%	Green
IGR	16	0.001%	Orange
5'Flank	1	0.000%	Green

@clairemcleod, nice find with the mutation_bcgsc_gene dataset. This is a gene × sample matrix, which we could transpose to achieve our desired matrix. Unfortunately, this dataset seems to only include 3,219 samples, whereas our processed mutation matrix has 8,499 samples.

dhimmel commented 8 years ago

a simple initial interface is optimal

I went with a simple solution. In dhimmel/cancer-data@ffe66ab26000379adcd7138b8ff39920d4692ef1, I retained only red and blue mutations (according to Xena), meaning orange and green mutations were removed. The only removed mutation effect category that was an appreciable portion of the data was "Silent" -- which I think we're all in agreement should be excluded.

I posted the mutation and expression datasets from this commit to figshare. Mutations were retained for 8,508 samples, 7,706 of which had corresponding expression data.

cognoma / cancer-data

Which types of mutation effects should be ignored? #2