MSKCC-Epi-Bio / gnomeR

Package to wrangle and visualize genomic data in R
https://mskcc-epi-bio.github.io/gnomeR/
Other
26 stars 19 forks source link

Create text binary matrix which retains specific mutation type in matrix #263

Open karissawhiting opened 1 year ago

karissawhiting commented 1 year ago

@edrill suggested this. This can be useful for oncoprints and more in-depth mutation specific analyses.

The relevant information seems to be in the following MAF columns:

table(gnomeR::mutations$mutationType, gnomeR::mutations$variantType)
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#>                         
#>                          DEL DNP INS ONP SNP
#>   Frame_Shift_Del         89   0   0   0   0
#>   Frame_Shift_Ins          0   0  38   0   0
#>   In_Frame_Del            23   0   0   0   0
#>   Missense_Mutation        0   5   0   0 479
#>   Nonsense_Mutation        0   1   1   0  57
#>   Nonstop_Mutation         0   0   0   1   0
#>   Splice_Site              6   0   0   0  24
#>   Translation_Start_Site   0   0   0   0   1

@edrill - what type of information from the above do you maintain in your matrix? Also, does this only apply to mutations or fusions/CNA as well?

@michaelcurry1123 - I'm thinking this could be a separate new function that doesn't rely on the other version of the binary matrix, but I'm open to other ideas. Not sure what to call it yet. I think we should add a check of possible levels (e.g. missense, splice, etc...)

edrill commented 1 year ago

The method I use to make matrices for Oncoprints works directly from an aggregated alteration file (combining maf, cna and fusion files). Can share code if of interest.

karissawhiting commented 1 year ago

@edrill Yes please share code if you can. Thank you

michaelcurry1123 commented 1 year ago
mut2 <- gnomeR::mutations %>% 
        group_by(sampleId,hugoGeneSymbol) %>% 

        filter(row_number()==1) %>% 
        ungroup() %>% 
        tidyr::pivot_wider( id_cols = "sampleId", names_from = "hugoGeneSymbol",
                   values_from = "mutationType", values_fill = "None")

This is the basis of the code it is very similar to create_gene_binary and possible to use internal functions to create this text matrix. @karissawhiting what are your thoughts ?

karissawhiting commented 1 year ago

@michaelcurry1123 I wonder what happens when (if) you have two types on mutations on the same gene...

For the numeric binary matrix it wouldn't matter but I could see that being a problem here. Maybe we create a vector for that cell and throw a warning?

michaelcurry1123 commented 1 year ago

can they have two of the same type of mutations (eg. two fusions on the same gene) or would it be a mutation, fusion or cna? @karissawhiting

karissawhiting commented 1 year ago

You could have two types of mutations on the same gene and we'd want to represent both in an oncoprint. If there are two of the same type of mutation (rare I think?) we would just count 1 (like presence/absence)

michaelcurry1123 commented 1 year ago

ok no problem did that!

edrill commented 1 year ago

Chiming in with my 2 cents --> What about the case where there is a copy number or fusion and a mutation? We definitely would want to show that on the oncoprint. Or if there are 2 mutations that aren't the same type, e.g. missense and frameshift - it is not possible to show both - how do you decide what to show? I usually look for those instances and replace with "Multiple muts."

edrill commented 1 year ago

I use this general code to keep all mutations/alterations in: dplyr::summarise( type_pre = paste(sort(alteration), collapse = ";"), )

karissawhiting commented 1 year ago

@edrill Thanks for the input! Yes, we are definitely going to have separate columns for mutation/fus/CNA so that all can be shown on the oncoprint. We were going back and forth on how to display multiple mutation types in 1 sampled on the same gene. I like the idea of having a "multiple mutations" annotation in the matrix and throwing a warning if this comes up in someone's data telling them they have to filter the data beforehand themselves if they want one over the other mutation type displayed.

esther- We also want to create a data check/dictionary of possible values the mutation type column can accept (e.g.. missense, truncating etc). Do you have a list of this anywhere ?

Thanks!

edrill commented 1 year ago

OK - I was assuming you were starting out with oncoprint() function code from complexheatmap package which requires e.g. "DELETION; MISSENSE;" in the same cell to show both on oncoprint.

In terms of mutation type values, I don't have a comprehensive list. But I just looked back at my two projects with largest sample sizes and these were the categories included:

3'Flank 3'UTR 5'Flank Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins Intron Missense_Mutation Nonsense_Mutation Nonstop_Mutation nonsynonymous_SNV Silent Splice_Region Splice_Site Translation_Start_Site

michaelcurry1123 commented 1 year ago
mut2 <- gnomeR::mutations %>% 
        group_by(sampleId,hugoGeneSymbol,mutationType) %>% 
        filter(row_number()==1) %>% 
        ungroup() %>% 
        group_by(sampleId, hugoGeneSymbol) %>% 
        summarise(alteration = paste(mutationType, collapse = ",")) %>% 
        ungroup() %>% 
         mutate(alteration = ifelse(grepl(",", alteration), "Multiple Mutations",alteration ))

cna2 <- gnomeR::cna %>% 
  group_by(sampleId,hugoGeneSymbol,alteration) %>% 
  filter(row_number()==1) %>% 
  ungroup() %>% 
  group_by(sampleId, hugoGeneSymbol) %>% 
  summarise(alteration = paste(alteration, collapse = ",")) 

fus2 <- gnomeR::sv %>% 
  group_by(sampleId,site1HugoSymbol,variantClass) %>% 
  filter(row_number()==1) %>% 
  ungroup() %>% 
  group_by(sampleId, site1HugoSymbol) %>% 
  summarise(alteration = paste(variantClass, collapse = ",")) %>% 
  rename(hugoGeneSymbol =site1HugoSymbol)

allgene <- bind_rows(cna2, fus2, mut2) %>% 
  group_by(sampleId, hugoGeneSymbol) %>% 
  summarise(alteration = paste(alteration, collapse = ",")) %>% 
   ungroup() %>% 
   tidyr::pivot_wider( id_cols = c("sampleId"), names_from = "hugoGeneSymbol",
                       values_from = "alteration", values_fill = NA_character_) 

Here is some code I came up with where I handle multiple mutations for mut file and then combine them all together. very very rough draft so if this isn't quiet on the fight track let me know!

karissawhiting commented 1 year ago

I think this code looks good. To recap, we discussed:

michaelcurry1123 commented 1 year ago

@karissawhiting I think this code gets us some of the way there, it is wide, addresses multiple fusions and cna. will have to look into the other stuff though also have some questions about the .del and .fus endings might be easier to chat through

mut2 <- gnomeR::mutations %>% 
        group_by(sampleId,hugoGeneSymbol,mutationType) %>% 
        filter(row_number()==1) %>% 
        ungroup() %>% 
        group_by(sampleId, hugoGeneSymbol) %>% 
        summarise(alteration = paste(mutationType, collapse = ",")) %>% 
        ungroup() %>% 
        mutate(alteration = ifelse(grepl(",", alteration), "Multiple Mutations",alteration )) %>% 
        tidyr::pivot_wider( id_cols = c("sampleId"), names_from = "hugoGeneSymbol",
                      values_from = "alteration", values_fill = NA_character_)

cna2 <- gnomeR::cna %>% 
  group_by(sampleId,hugoGeneSymbol,alteration) %>% 
  filter(row_number()==1) %>% 
  ungroup() %>% 
  group_by(sampleId, hugoGeneSymbol) %>% 
  summarise(alteration = paste(alteration, collapse = ",")) %>% 
  ungroup() %>% 
  mutate(alteration = ifelse(grepl(",", alteration), "Multiple CNAs",alteration )) %>% 
  tidyr::pivot_wider( id_cols = c("sampleId"), names_from = "hugoGeneSymbol",
                      names_glue =  "{hugoGeneSymbol}.cna", 
                      values_from = "alteration", values_fill = NA_character_)

fus2 <- gnomeR::sv %>% 
  group_by(sampleId,site1HugoSymbol,variantClass) %>% 
  filter(row_number()==1) %>% 
  ungroup() %>% 
  group_by(sampleId, site1HugoSymbol) %>% 
  summarise(alteration = paste(variantClass, collapse = ",")) %>% 
  rename(hugoGeneSymbol =site1HugoSymbol) %>% 
  ungroup() %>% 
  mutate(alteration = ifelse(grepl(",", alteration), "Multiple Fusions",alteration )) %>% 
  tidyr::pivot_wider( id_cols = c("sampleId"), names_from = "hugoGeneSymbol",
                      names_glue =  "{hugoGeneSymbol}.fus", 
                      values_from = "alteration", values_fill = NA_character_)

allgene <-   Reduce(function(x,y){full_join(x,y, by ="sampleId")}, list(cna2,fus2,mut2))
michaelcurry1123 commented 1 year ago

@karissawhiting right now the fusion and cna files have .fus and .cna at the end, we were gonna keep as it and change if we needed to added .amp or .del to the suffix

hfuchs5 commented 1 year ago

go with long format instead of wide and then we can make another internal function to pivot wide if needed