d3b-center / ticket-tracker-OPC

A repo to generate and track tickets for ped OT
2 stars 0 forks source link

Proposed Analysis: Create mutation frequencies for Ped OT platform #8

Closed afarrel closed 3 years ago

afarrel commented 3 years ago

What are the scientific goals of the analysis?

Update oncoprint-landscape module to output a gene mutation frequency TSV per histology (or cohort) for Pediatric Open Targets platform. For this, we will use all genes, not genes of interest.

Note: we may want to do this on a mutation level and add significance values as the OT takes in those values. Still need to discuss the best way to do this.

What input data are required for this analysis?

Consensus MAFs

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

1 day

Who will complete the analysis (please add a GitHub handle here if relevant)?

@ewafula

jharenza commented 3 years ago

Adding to this - we should probably also consider which mutations are going into this matrix - probably will want to exclude synonymous, silent, RNA, and intergenic for now. Do we know how OT will rank mutations? Is it by frequency per histology/cohort or by functional consequence + frequency? Thoughts @kgaonkar6 and @taylordm? cc @allisonheath

logstar commented 3 years ago

Sorry for the delay on this analysis.

I was wondering which mutation files in the PediatricOpenTargets/OpenPedCan-analysis v5 data release I should use for generating the mutation frequency tables. The oncoprint-landscape module uses the following files, but they are from AlexsLemonade/OpenPBTA-analysis data release.

maf_consensus=../../data/pbta-snv-consensus-mutation.maf.tsv.gz
fusion_file=../../data/pbta-fusion-putative-oncogenic.tsv
histologies_file=../../data/pbta-histologies.tsv
focal_directory=../focal-cn-file-preparation/results
focal_cnv_file=${focal_directory}/consensus_seg_most_focal_cn_status.tsv.gz

Should we rerun focal-cn-file-preparation module on PediatricOpenTargets/OpenPedCan-analysis release data? The focal_cnv_file is also generated using the AlexsLemonade/OpenPBTA-analysis release data.

After figuring out which mutation files to use, I am planning to merge them like the oncoprint-landscape module as following:

maf_object <- prepare_maf_object(
  maf_df = maf_df,
  cnv_df = cnv_df,
  metadata = metadata,
  fusion_df = fusion_df
)

(link to the code)

Then, generate gene summary tables for the merged mutation object using maftools::getGeneSummary, which would output a table that contains the number of mutated samples like the following. I will compute mutation frequency as MutatedSamples / total.

Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins Missense_Mutation Nonsense_Mutation Nonstop_Mutation Splice_Site Translation_Start_Site total MutatedSamples AlteredSamples
MUC3A 3 2 2 1 22 0 0 0 0 30 26 26
MUC5AC 0 0 0 0 32 0 0 0 0 32 24 24
MUC4 0 0 1 2 23 0 0 0 0 26 21 21
ALK 0 0 0 0 18 0 0 0 0 18 18 18
NBPF10 0 0 0 0 19 0 0 0 0 19 17 17
HLA-A 0 0 0 0 18 0 0 0 0 18 17 17

Regarding the note:

Note: we may want to do this on a mutation level and add significance values as the OT takes in those values. Still need to discuss the best way to do this.

I wonder if you could clarify the procedure to do this analysis on a mutation level.

For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.

kgaonkar6 commented 3 years ago

The consensus maf in v5 is snv-consensus-plus-hotspots.maf.tsv.gz

logstar commented 3 years ago

The consensus maf in v5 is snv-consensus-plus-hotspots.maf.tsv.gz

Thank you for the quick reply! I will use snv-consensus-plus-hotspots.maf.tsv.gz to generate mutation frequency tables for each histology.

logstar commented 3 years ago

@kgaonkar6 I was wondering if I could directly use the following files for this analysis:

My concern is that these files might be generated using AlexsLemonade/OpenPBTA-analysis data release. I am not sure if they are compatible with PediatricOpenTargets/OpenPedCan-analysis data release.

kgaonkar6 commented 3 years ago

You are right! Those files and (pbta-fusion-putative-oncogenic.tsv) are not updated to include samples added as part of OpenPedCan analysis yet but we will be rerunning before the next release. So I can keep you posted about that.

To me it looks like the requirement here is just the gene mutation frequency so should we add another script in the module to just generate the frequencies using an updated version of prepare_maf_object function ( make cnv_df optional so we can use it now ) and maftools::getGeneSummary as you suggested ?

logstar commented 3 years ago

You are right! Those files and (pbta-fusion-putative-oncogenic.tsv) are not updated to include samples added as part of OpenPedCan analysis yet but we will be rerunning before the next release. So I can keep you posted about that.

To me it looks like the requirement here is just the gene mutation frequency so should we add another script in the module to just generate the frequencies using an updated version of prepare_maf_object function ( make cnv_df optional so we can use it now ) and maftools::getGeneSummary as you suggested ?

Thank you for the quick reply and the suggestion!

I will skip the CNV part for now. I am planning to use an empty CNV file as a place holder for this analysis, so the original code can be reused for this analysis.

I am planning to update the 01-plot-oncoprint.R, so that the mutation frequency tables are consistent with the corresponding plots.

jharenza commented 3 years ago

I am planning to update the 01-plot-oncoprint.R, so that the mutation frequency tables are consistent with the corresponding plots.

One other note is we do not want to do top 50 here, but all genes mentioned in the comment above, excluding

synonymous, silent, RNA, and intergenic

It also sounds like this analysis should be done on a cohort+cancer_group and then cancer_group level, separately for primary tumors and relapse/recurrent/progressive tumors using the independent specimen lists. This analysis has morphed a bit since initially written.

For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.

Regarding significance, MutSigCV doesn't perform well on low Ns, and many of these histologies have a low N. I think for now, we are not going to worry about designating significance, but rather try to come up with a file of mutation frequencies, plus the additional annotation from @taylordm 's sample table. For the latter, we should probably just come up with a new ticket for creating the full annotated table.

below is the sample table: OT_SomaticTables_SNV_CNV.xlsx

logstar commented 3 years ago

I am planning to update the 01-plot-oncoprint.R, so that the mutation frequency tables are consistent with the corresponding plots.

One other note is we do not want to do top 50 here, but all genes mentioned in the comment above, excluding

synonymous, silent, RNA, and intergenic

It also sounds like this analysis should be done on a cohort+cancer_group and then cancer_group level, separately for primary tumors and relapse/recurrent/progressive tumors using the independent specimen lists. This analysis has morphed a bit since initially written.

For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.

Regarding significance, MutSigCV doesn't perform well on low Ns, and many of these histologies have a low N. I think for now, we are not going to worry about designating significance, but rather try to come up with a file of mutation frequencies, plus the additional annotation from @taylordm 's sample table. For the latter, we should probably just come up with a new ticket for creating the full annotated table.

below is the sample table: OT_SomaticTables_SNV_CNV.xlsx

@jharenza Thank you for the detailed notes. They are very helpful for implementing this analysis.

I will generate the mutation frequency tables accordingly. Then, I will annotate the SNV table of mutation frequencies according to https://github.com/PediatricOpenTargets/ticket-tracker/issues/64.

I will skip the significance part for now.

jharenza commented 3 years ago

Sure thing, let me know if you have any questions along the way!

logstar commented 3 years ago

Hi @kgaonkar6. I was wondering if I could use independent-specimens.rnaseq.primary-plus.tsv from the independent-samples module to subset the fusion table.

Although the fusion table is work in progress at https://github.com/PediatricOpenTargets/ticket-tracker/issues/7, the independent sample determination in the fusion table is related to the filtering of the snv-consensus-plus-hotspots.maf.tsv.gz.

In the original code, fusion independent samples are determined by matching sample_ids to the Kids_First_Biospecimen_ID in independent-specimens.wgs.primary.tsv, so the sample_ids with more than 2 rows in the histologies_df are removed from the snv-consensus-plus-hotspots.maf.tsv.gz in order to unambiguous matching between WGS and RNA-seq samples. Relevant code is listed below:

# in 00-map-to-sample_id.R
# An ambiguous sample_id will have more than 2 rows associated with it in the
# histologies file when looking at tumor samples -- that means we won't be able
# to determine when an WGS/WXS assay maps to an RNA-seq assay for the purpose of
# the oncoprint plot
ambiguous_sample_ids <- histologies_df %>%
  filter(sample_type == "Tumor",
         composition == "Solid Tissue") %>%
  group_by(sample_id) %>%
  tally() %>%
  filter(n > 2) %>%
  pull(sample_id)

ambiguous_biospecimens <- histologies_df %>%
  filter(sample_id %in% ambiguous_sample_ids) %>%
  pull(Kids_First_Biospecimen_ID)
# ...
biospecimens_to_remove <- unique(c(ambiguous_biospecimens,
                                   not_tumor_biospecimens))

# Filter the files!
maf_df <- maf_df %>%
  dplyr::filter(!(Tumor_Sample_Barcode %in% biospecimens_to_remove))
# ...

I found some sample IDs are mapping to hundreds or even thousands of samples, so I am concerned about removing the ambiguous_biospecimens.

> histologies_df %>%
+     filter(sample_type == "Tumor",
+            composition == "Solid Tissue") %>%
+     group_by(sample_id) %>%
+     tally() %>%
+     filter(n > 2)
# A tibble: 19 x 2
   sample_id     n
   <chr>     <int>
 1 01        11073
 2 02           49
 3 03          470
 4 05            9
 5 06          394
 6 09            5
 7 7316-14       3
 8 7316-1463     4
 9 7316-158      3
10 7316-161      3
11 7316-1765     4
12 7316-178      3
13 7316-3214     3
14 7316-3230     4
15 7316-3231     6
16 7316-85       3
17 7316-87       3
18 A16915        3
19 A18777        3
jharenza commented 3 years ago

hey @logstar - you can disregard fusions and CNVs completely here, only looking for SNV frequencies using the indep DNA sample lists

logstar commented 3 years ago

hey @logstar - you can disregard fusions and CNVs completely here, only looking for SNV frequencies using the indep DNA sample lists

Thank you for the quick reply. I will disregard the fusions and CNVs in this analysis.

Sorry for being distracted by the CNVs and fusions. I was trying to figure out the original code and make this module compatible to the full OT data release, so we will not need to revise the code much when CNV and fusions are available. Now, I will get the SNV mutation frequency table generated before worrying about CNV or fusions.

logstar commented 3 years ago

Hi @jharenza. I was wondering whether Translation_Start_Site Variant_Classification should be considered as non-synonyms.

In the original code, only the following Variant_Classifications are considered as non-synonyms, and Translation_Start_Site is not included.

    read.maf(
      maf = maf_df,
      clinicalData = metadata,
      cnTable = cnv_df,
      removeDuplicatedVariants = FALSE,
      vc_nonSyn = c(
        "Frame_Shift_Del",
        "Frame_Shift_Ins",
        "Splice_Site",
        "Nonsense_Mutation",
        "Nonstop_Mutation",
        "In_Frame_Del",
        "In_Frame_Ins",
        "Missense_Mutation",
        "Fusion",
        "Multi_Hit",
        "Multi_Hit_Fusion",
        "Hom_Deletion",
        "Hem_Deletion",
        "Amp",
        "Del"
      )
    )

However, the default non-synonyms in maftools are the following, which has the additional Translation_Start_Site.

  if(is.null(vc_nonSyn)){
    vc.nonSilent = c("Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site",
                     "Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del",
                     "In_Frame_Ins", "Missense_Mutation")
  }

All unique Variant_Classification in ../../data/snv-consensus-plus-hotspots.maf.tsv.gz are

3'Flank
3'UTR
5'Flank
5'UTR
Frame_Shift_Del
Frame_Shift_Ins
IGR
In_Frame_Del
In_Frame_Ins
Intron
Missense_Mutation
Nonsense_Mutation
Nonstop_Mutation
RNA
Silent
Splice_Region
Splice_Site
Translation_Start_Site
jharenza commented 3 years ago

we can add it as non-silent!

logstar commented 3 years ago

we can add it as non-silent!

Will do. Thank you for the quick reply!

logstar commented 3 years ago

Closed with PR https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/45 merged.