Closed afarrel closed 3 years ago
Adding to this - we should probably also consider which mutations are going into this matrix - probably will want to exclude synonymous, silent, RNA, and intergenic for now. Do we know how OT will rank mutations? Is it by frequency per histology/cohort or by functional consequence + frequency? Thoughts @kgaonkar6 and @taylordm? cc @allisonheath
Sorry for the delay on this analysis.
I was wondering which mutation files in the PediatricOpenTargets/OpenPedCan-analysis v5 data release I should use for generating the mutation frequency tables. The oncoprint-landscape
module uses the following files, but they are from AlexsLemonade/OpenPBTA-analysis data release.
maf_consensus=../../data/pbta-snv-consensus-mutation.maf.tsv.gz
fusion_file=../../data/pbta-fusion-putative-oncogenic.tsv
histologies_file=../../data/pbta-histologies.tsv
focal_directory=../focal-cn-file-preparation/results
focal_cnv_file=${focal_directory}/consensus_seg_most_focal_cn_status.tsv.gz
Should we rerun focal-cn-file-preparation
module on PediatricOpenTargets/OpenPedCan-analysis release data? The focal_cnv_file
is also generated using the AlexsLemonade/OpenPBTA-analysis release data.
After figuring out which mutation files to use, I am planning to merge them like the oncoprint-landscape
module as following:
maf_object <- prepare_maf_object(
maf_df = maf_df,
cnv_df = cnv_df,
metadata = metadata,
fusion_df = fusion_df
)
Then, generate gene summary tables for the merged mutation object using maftools::getGeneSummary
, which would output a table that contains the number of mutated samples like the following. I will compute mutation frequency as MutatedSamples / total
.
Hugo_Symbol | Frame_Shift_Del | Frame_Shift_Ins | In_Frame_Del | In_Frame_Ins | Missense_Mutation | Nonsense_Mutation | Nonstop_Mutation | Splice_Site | Translation_Start_Site | total | MutatedSamples | AlteredSamples |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MUC3A | 3 | 2 | 2 | 1 | 22 | 0 | 0 | 0 | 0 | 30 | 26 | 26 |
MUC5AC | 0 | 0 | 0 | 0 | 32 | 0 | 0 | 0 | 0 | 32 | 24 | 24 |
MUC4 | 0 | 0 | 1 | 2 | 23 | 0 | 0 | 0 | 0 | 26 | 21 | 21 |
ALK | 0 | 0 | 0 | 0 | 18 | 0 | 0 | 0 | 0 | 18 | 18 | 18 |
NBPF10 | 0 | 0 | 0 | 0 | 19 | 0 | 0 | 0 | 0 | 19 | 17 | 17 |
HLA-A | 0 | 0 | 0 | 0 | 18 | 0 | 0 | 0 | 0 | 18 | 17 | 17 |
Regarding the note:
Note: we may want to do this on a mutation level and add significance values as the OT takes in those values. Still need to discuss the best way to do this.
I wonder if you could clarify the procedure to do this analysis on a mutation level.
For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.
The consensus maf in v5 is snv-consensus-plus-hotspots.maf.tsv.gz
The consensus maf in v5 is
snv-consensus-plus-hotspots.maf.tsv.gz
Thank you for the quick reply! I will use snv-consensus-plus-hotspots.maf.tsv.gz
to generate mutation frequency tables for each histology.
@kgaonkar6 I was wondering if I could directly use the following files for this analysis:
analyses/focal-cn-file-preparation/results/consensus_seg_most_focal_cn_status.tsv.gz
analyses/interaction-plots/results/gene_disease_top50.tsv
analyses/focal-cn-file-preparation/results/consensus_seg_focal_cn_recurrent_genes.tsv
My concern is that these files might be generated using AlexsLemonade/OpenPBTA-analysis
data release. I am not sure if they are compatible with PediatricOpenTargets/OpenPedCan-analysis
data release.
You are right! Those files and (pbta-fusion-putative-oncogenic.tsv) are not updated to include samples added as part of OpenPedCan analysis yet but we will be rerunning before the next release. So I can keep you posted about that.
To me it looks like the requirement here is just the gene mutation frequency so should we add another script in the module to just generate the frequencies using an updated version of prepare_maf_object
function ( make cnv_df optional so we can use it now ) and maftools::getGeneSummary as you suggested ?
You are right! Those files and (pbta-fusion-putative-oncogenic.tsv) are not updated to include samples added as part of OpenPedCan analysis yet but we will be rerunning before the next release. So I can keep you posted about that.
To me it looks like the requirement here is just the gene mutation frequency so should we add another script in the module to just generate the frequencies using an updated version of
prepare_maf_object
function ( make cnv_df optional so we can use it now ) and maftools::getGeneSummary as you suggested ?
Thank you for the quick reply and the suggestion!
I will skip the CNV part for now. I am planning to use an empty CNV file as a place holder for this analysis, so the original code can be reused for this analysis.
I am planning to update the 01-plot-oncoprint.R
, so that the mutation frequency tables are consistent with the corresponding plots.
I am planning to update the
01-plot-oncoprint.R
, so that the mutation frequency tables are consistent with the corresponding plots.
One other note is we do not want to do top 50 here, but all genes mentioned in the comment above, excluding
synonymous, silent, RNA, and intergenic
It also sounds like this analysis should be done on a cohort+cancer_group
and then cancer_group
level, separately for primary tumors and relapse/recurrent/progressive tumors using the independent specimen lists. This analysis has morphed a bit since initially written.
For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.
Regarding significance, MutSigCV doesn't perform well on low Ns, and many of these histologies have a low N. I think for now, we are not going to worry about designating significance, but rather try to come up with a file of mutation frequencies, plus the additional annotation from @taylordm 's sample table. For the latter, we should probably just come up with a new ticket for creating the full annotated table.
below is the sample table: OT_SomaticTables_SNV_CNV.xlsx
I am planning to update the
01-plot-oncoprint.R
, so that the mutation frequency tables are consistent with the corresponding plots.One other note is we do not want to do top 50 here, but all genes mentioned in the comment above, excluding
synonymous, silent, RNA, and intergenic
It also sounds like this analysis should be done on a
cohort+cancer_group
and thencancer_group
level, separately for primary tumors and relapse/recurrent/progressive tumors using the independent specimen lists. This analysis has morphed a bit since initially written.For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.
Regarding significance, MutSigCV doesn't perform well on low Ns, and many of these histologies have a low N. I think for now, we are not going to worry about designating significance, but rather try to come up with a file of mutation frequencies, plus the additional annotation from @taylordm 's sample table. For the latter, we should probably just come up with a new ticket for creating the full annotated table.
below is the sample table: OT_SomaticTables_SNV_CNV.xlsx
@jharenza Thank you for the detailed notes. They are very helpful for implementing this analysis.
I will generate the mutation frequency tables accordingly. Then, I will annotate the SNV table of mutation frequencies according to https://github.com/PediatricOpenTargets/ticket-tracker/issues/64.
I will skip the significance part for now.
Sure thing, let me know if you have any questions along the way!
Hi @kgaonkar6. I was wondering if I could use independent-specimens.rnaseq.primary-plus.tsv
from the independent-samples
module to subset the fusion table.
Although the fusion table is work in progress at https://github.com/PediatricOpenTargets/ticket-tracker/issues/7, the independent sample determination in the fusion table is related to the filtering of the snv-consensus-plus-hotspots.maf.tsv.gz
.
In the original code, fusion independent samples are determined by matching sample_id
s to the Kids_First_Biospecimen_ID
in independent-specimens.wgs.primary.tsv
, so the sample_id
s with more than 2 rows in the histologies_df
are removed from the snv-consensus-plus-hotspots.maf.tsv.gz
in order to unambiguous matching between WGS and RNA-seq samples. Relevant code is listed below:
# in 00-map-to-sample_id.R
# An ambiguous sample_id will have more than 2 rows associated with it in the
# histologies file when looking at tumor samples -- that means we won't be able
# to determine when an WGS/WXS assay maps to an RNA-seq assay for the purpose of
# the oncoprint plot
ambiguous_sample_ids <- histologies_df %>%
filter(sample_type == "Tumor",
composition == "Solid Tissue") %>%
group_by(sample_id) %>%
tally() %>%
filter(n > 2) %>%
pull(sample_id)
ambiguous_biospecimens <- histologies_df %>%
filter(sample_id %in% ambiguous_sample_ids) %>%
pull(Kids_First_Biospecimen_ID)
# ...
biospecimens_to_remove <- unique(c(ambiguous_biospecimens,
not_tumor_biospecimens))
# Filter the files!
maf_df <- maf_df %>%
dplyr::filter(!(Tumor_Sample_Barcode %in% biospecimens_to_remove))
# ...
I found some sample IDs are mapping to hundreds or even thousands of samples, so I am concerned about removing the ambiguous_biospecimens
.
> histologies_df %>%
+ filter(sample_type == "Tumor",
+ composition == "Solid Tissue") %>%
+ group_by(sample_id) %>%
+ tally() %>%
+ filter(n > 2)
# A tibble: 19 x 2
sample_id n
<chr> <int>
1 01 11073
2 02 49
3 03 470
4 05 9
5 06 394
6 09 5
7 7316-14 3
8 7316-1463 4
9 7316-158 3
10 7316-161 3
11 7316-1765 4
12 7316-178 3
13 7316-3214 3
14 7316-3230 4
15 7316-3231 6
16 7316-85 3
17 7316-87 3
18 A16915 3
19 A18777 3
hey @logstar - you can disregard fusions and CNVs completely here, only looking for SNV frequencies using the indep DNA sample lists
hey @logstar - you can disregard fusions and CNVs completely here, only looking for SNV frequencies using the indep DNA sample lists
Thank you for the quick reply. I will disregard the fusions and CNVs in this analysis.
Sorry for being distracted by the CNVs and fusions. I was trying to figure out the original code and make this module compatible to the full OT data release, so we will not need to revise the code much when CNV and fusions are available. Now, I will get the SNV mutation frequency table generated before worrying about CNV or fusions.
Hi @jharenza. I was wondering whether Translation_Start_Site
Variant_Classification
should be considered as non-synonyms.
In the original code, only the following Variant_Classification
s are considered as non-synonyms, and Translation_Start_Site
is not included.
read.maf(
maf = maf_df,
clinicalData = metadata,
cnTable = cnv_df,
removeDuplicatedVariants = FALSE,
vc_nonSyn = c(
"Frame_Shift_Del",
"Frame_Shift_Ins",
"Splice_Site",
"Nonsense_Mutation",
"Nonstop_Mutation",
"In_Frame_Del",
"In_Frame_Ins",
"Missense_Mutation",
"Fusion",
"Multi_Hit",
"Multi_Hit_Fusion",
"Hom_Deletion",
"Hem_Deletion",
"Amp",
"Del"
)
)
However, the default non-synonyms in maftools are the following, which has the additional Translation_Start_Site
.
if(is.null(vc_nonSyn)){
vc.nonSilent = c("Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site",
"Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del",
"In_Frame_Ins", "Missense_Mutation")
}
All unique Variant_Classification
in ../../data/snv-consensus-plus-hotspots.maf.tsv.gz
are
3'Flank
3'UTR
5'Flank
5'UTR
Frame_Shift_Del
Frame_Shift_Ins
IGR
In_Frame_Del
In_Frame_Ins
Intron
Missense_Mutation
Nonsense_Mutation
Nonstop_Mutation
RNA
Silent
Splice_Region
Splice_Site
Translation_Start_Site
we can add it as non-silent!
we can add it as non-silent!
Will do. Thank you for the quick reply!
Closed with PR https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/45 merged.
What are the scientific goals of the analysis?
Update oncoprint-landscape module to output a gene mutation frequency TSV per histology (or cohort) for Pediatric Open Targets platform. For this, we will use all genes, not genes of interest.
Note: we may want to do this on a mutation level and add significance values as the OT takes in those values. Still need to discuss the best way to do this.
What input data are required for this analysis?
Consensus MAFs
How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?
1 day
Who will complete the analysis (please add a GitHub handle here if relevant)?
@ewafula