PoisonAlien / maftools

Summarize, Analyze and Visualize MAF files from TCGA or in-house studies.
http://bioconductor.org/packages/release/bioc/html/maftools.html
MIT License
442 stars 218 forks source link

Save Silent Variants for downstream analysis #955

Closed AnandamidaCBD closed 1 year ago

AnandamidaCBD commented 1 year ago

Describe the issue First of all, I wanna thanks the developers for such a marvellous tool. I need to point that I am really naive in bioinformatics, so I will be really sorry if I am making silly mistakes.

I did generate merged MAFs (with silent variants) with a shared pipeline. As I saw, you just got to specify a variants vector in the argument _vcnonSyn of read-maf function. I had multiple MAF files (from different VC) per sample, so I generate a consensus MAF for this dataset. We owned this raw data and I succeeded creating CONSENSUS with and without silent variants (file sizes are different).

However, here begins my issue. We received a really large consensus MAF from a different dataset, but I do not have the raw data. I tried to run this command but the output silent MAF file has the same size as the original "large consensus MAF" (and the downstream signature analysis exhibit the same results for both MAF):

write_main_maf <- function(maf, filename) {

  data.table::fwrite(x = data.table::rbindlist(list(maf@data, maf@maf.silent), use.names = TRUE, fill = TRUE),
                     file = paste(filename,'.maf', sep=''),
                     sep='\t',
                     quote = FALSE, row.names = FALSE)
}

xx<- read.maf(maf = xx, vc_nonSyn = c("Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site","Nonsense_Mutation",
                                      "Nonstop_Mutation", "In_Frame_Del","In_Frame_Ins", "Missense_Mutation","Silent","Start_Codon_Del", "Stop_Codon_Ins",
                                      "3'UTR", "5'UTR", "mature_miRNA_variant", "exon_variant", "non_coding_exon_variant", 
                                      "non_coding_transcript_exon_variant", "non_coding_transcript_variant", 
                                      "nc_transcript_variant","Amp","Del"))

setwd("~/Borja/data_2/MIBC_WES_data")
write_main_maf(xx, "MIBC_CONSENSUS2_S")```

Previously, I used this code for merging, write and add silent variants to MAFs:

```library(knitr)
library(tidyverse)
library(maftools)

vc_Variants=c("Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site","Nonsense_Mutation",
               "Nonstop_Mutation", "In_Frame_Del","In_Frame_Ins", "Missense_Mutation","Silent","Start_Codon_Del", "Stop_Codon_Ins",
               "3'UTR", "5'UTR", "mature_miRNA_variant", "exon_variant", "non_coding_exon_variant", 
               "non_coding_transcript_exon_variant", "non_coding_transcript_variant", 
               "nc_transcript_variant","Amp","Del")

nonsyn_readMAF <- function(file, TCGA.check=F) {
  read.maf(file, 
           isTCGA = TCGA.check,
           verbose = F,
           vc_nonSyn=vc_Variants)
}

nonsyn_mergeMAF <- function(files, TCGA.check=F, folder="") {

  filenames <- paste(folder, files, sep="")

  merge_mafs(filenames,
             vc_nonSyn=vc_Variants,
             isTCGA = TCGA.check)
}

write_main_maf <- function(maf, filename) {

  data.table::fwrite(x = data.table::rbindlist(list(maf@data, maf@maf.silent), use.names = TRUE, fill = TRUE),
                     file = paste(filename,'.maf', sep=''),
                     sep='\t',
                     quote = FALSE, row.names = FALSE)
}

outersect <- function(x, y) {

  sort(c(x[!x%in%y],
         y[!y%in%x]))
}

# folder containing the MAF files
## Balbas et al.
setwd("~/Borja/data/NMIBC_WES_data/Balbas_et_al/MAFs/")
# set up pattern
pattern <- "*.maf"

# list files for each variant caller
muse.files <- list.files("./muse/", pattern = pattern)
mutect2.files <- list.files("./mutect2/", pattern = pattern)
somaticSniper.files <- list.files("./somatic_sniper/", pattern = pattern)
strelka.files <- list.files("./strelka/", pattern = pattern)
varscan2.files <- list.files("./varscan/", pattern = pattern)

# merge MAFs
Balbas.muse <- nonsyn_mergeMAF(muse.files, TCGA.check = F, folder = './muse/')
Balbas.mutect2 <- nonsyn_mergeMAF(mutect2.files, TCGA.check = F, folder = './mutect2/')
Balbas.somaticSniper <- nonsyn_mergeMAF(somaticSniper.files, TCGA.check = F, folder = './somatic_sniper/')
Balbas.strelka <- nonsyn_mergeMAF(strelka.files, TCGA.check = F, folder = './strelka/')
Balbas.varscan2 <- nonsyn_mergeMAF(varscan2.files, TCGA.check = F, folder = './varscan/')

# set directory
setwd("../Consensus/")
# write maf
write_main_maf(Balbas.muse, "Balbas.muse")
write_main_maf(Balbas.mutect2, "Balbas.mutect2")
write_main_maf(Balbas.somaticSniper, "Balbas.somaticSniper")
write_main_maf(Balbas.strelka, "Balbas.strelka")
write_main_maf(Balbas.varscan2, "Balbas.varscan2")

### Generate consensus:
# store data
Balbas_data <- rbind(Balbas.muse@data, Balbas.mutect2@data, Balbas.somaticSniper@data, Balbas.strelka@data, Balbas.varscan2@data)
# define columns that will serve as ID for unique row identification
ID <- c("Hugo_Symbol", "Transcript_ID", "Tumor_Sample_Barcode", "Chromosome", "Start_Position", "Variant_Classification")
# remove first two columns (source MAFs)
Balbas_data <- Balbas_data[,-c(1)]
# add count for duplicates with designated columnsas unique ID
Balbas_data.count <- Balbas_data %>% 
  add_count(Hugo_Symbol, Transcript_ID, Tumor_Sample_Barcode, Chromosome, Start_Position, Variant_Classification)
# filter to result in consensus of 2 and 3 variant callers
Balbas_data.consensus.2 <- Balbas_data.count %>% filter(n > 1)
save(Balbas_data.consensus.2, file = 'Balbas_data.consensus.2.rda')
Balbas_data.consensus.3 <- Balbas_data.count %>% filter(n > 2)
save(Balbas_data.consensus.3, file = 'Balbas_data.consensus.3.rda')

##################################
setwd("~/Borja/data/metadata_form/")
Balbas_metadato= read.csv(file = "Balbas_clinical_data.csv")
setwd("~/Borja/data_2/NMIBC_WES_data/Balbas_et_al/Consensus")

syn_vars = Balbas_data.consensus.3[Variant_Classification %in% "Silent"]
x_1 <- MAF(nonSyn = Balbas_data.consensus.3, syn = syn_vars, clinicalData = Balbas_metadato, verbose = TRUE)
write_main_maf(x_1, "Balbas.consensus.3_SILENT")

syn_vars_2 = Balbas_data.consensus.2[Variant_Classification %in% "Silent"]
xx_1 <- MAF(nonSyn = Balbas_data.consensus.2, syn= syn_vars_2, clinicalData = Balbas_metadato, verbose = TRUE)
write_main_maf(xx_1, "Balbas.consensus.2_SILENT")

Session info Run sessionInfo() and post the output below


R version 4.2.0 (2022-04-22)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS/LAPACK: /local/ECG/miniconda3/envs/maftool/lib/libopenblasp-r0.3.23.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] maftools_2.14.0 lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.2   tidyverse_2.0.0
[12] knitr_1.43     

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-3 pillar_1.9.0       compiler_4.2.0     tools_4.2.0        lattice_0.21-8     lifecycle_1.0.3    gtable_0.3.3       timechange_0.2.0   pkgconfig_2.0.3   
[10] rlang_1.1.1        Matrix_1.5-4.1     cli_3.6.1          rstudioapi_0.14    yaml_2.3.7         xfun_0.39          withr_2.5.0        generics_0.1.3     vctrs_0.6.2       
[19] hms_1.1.3          grid_4.2.0         tidyselect_1.2.0   data.table_1.14.8  glue_1.6.2         R6_2.5.1           DNAcopy_1.72.3     fansi_1.0.4        survival_3.5-5    
[28] tzdb_0.4.0         magrittr_2.0.3     splines_4.2.0      scales_1.2.1       colorspace_2.1-0   utf8_1.2.3         stringi_1.7.6      munsell_0.5.0     

MIBC_CONS Balbas

AnandamidaCBD commented 1 year ago

I am sorry for my horrible aesthetic :/ It's my first issue on github

PoisonAlien commented 1 year ago

Hi, No worries and thanks for the issue. I am bit swamped with work today, I will try to look into it as soon as I can.

PoisonAlien commented 1 year ago

Hi,

Have you been able to solve the issue? Also to note that vc_nonSyn argument is to specify variants with consequences. Your list of variant classifications - vc_nonSyn- contains silent variants as well. Essentially you are considering everything as variants with functional consequences.

AnandamidaCBD commented 1 year ago

I modified and added this line to my code, and then I ran again my pre-designed maftool functions:

xX_data <- xX@data %>%
  select(-Source_MAF) %>%
  add_count(Hugo_Symbol, Transcript_ID, Tumor_Sample_Barcode, Chromosome, Start_Position, Variant_Classification)

I had this error before: Error in as_tibble(): ! Column name Source_MAF must not be duplicated. Use .name_repair to specify repair. Caused by error in repaired_names(): ! Names must be unique. ✖ These names are duplicated:

In summary, the line of code takes the data table from the MAF object xX, removes the "Source_MAF" column, and then adds a new column "n" representing the count of rows for each unique combination of Hugo_Symbol, Transcript_ID, Tumor_Sample_Barcode, Chromosome, Start_Position, and Variant_Classification

AnandamidaCBD commented 1 year ago

Taking advantage of this opportunity, I would like to ask you about the "maf.silent" table.

I am a little bit confused about the difference between the "maf.silent" table and the silent variants I may find in the "Variant.classification" table when I specify them with "vc_nonSyn".

What I mean is, when I use "read.maf(vc_nonSyn = c(silent, etc))" and obtain, for instance, 95k silent variants, I then realized that the number of silent variants shown in the "Variant.Classification" table is significantly lower. However, the total number of variants I got in the "Data" table is considerably higher when I used "vc_nonSyn = c(silent, etc))", as I would expect.

I would like to thank you for your help and patience in advance!

PoisonAlien commented 1 year ago

Hi,

I am having hard time following the issue. I will still try to answer..

  1. My guess for Source_MAF" at locations 133 and 134 - it seems the header line is duplicated. How did you concatenate your mafs (I guess cat command from terminal?
  2. You can define your own set of Variant Classifications as non-synonymous with vc_nonSyn argument. Rest of the mutations will be considered as silent. You are including everything as non synonymous variants with your vc_nonSyndefinition (for example Silent in your list). I would recommend just to use read.maf in default mode and not define your own list of vc_nonSyn.
AnandamidaCBD commented 1 year ago

Apologies for not providing a clear explanation earlier. I managed to resolve the issue I was facing. Initially, I was attempting to extract tables from a MAF object, and after incorporating some new code, I was able to extract the IDs as data frames, which I could then write and save.

One of my main objectives was to obtain silent variants and store them in the MAF file to have a larger number of variants for downstream analysis. I would like to take this opportunity to ask about the MAF.silent table.


MAF_silent <- list.files(path = "~/Borja/data_2/MIBC_WES_data", pattern = "MIBC")
MAF <- list.files(path = "~/Borja/data_2/MIBC_WES_data", pattern = "MIBC")

MAF_silent <- read.maf(maf= MAF_silent, vc_nonSyn = c("Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site","Nonsense_Mutation",
                                                                   "Nonstop_Mutation", "In_Frame_Del","In_Frame_Ins", "Missense_Mutation","Silent","Start_Codon_Del", "Stop_Codon_Ins",
                                                                   "3'UTR", "5'UTR", "mature_miRNA_variant", "exon_variant", "non_coding_exon_variant", 
                                                                   "non_coding_transcript_exon_variant", "non_coding_transcript_variant", 
                                                                   "nc_transcript_variant","Amp","Del"))

> sum(MAF_silent@variant.classification.summary$Silent)
[1] 63792
> sum(MAF_silent@variant.classification.summary$total)
[1] 247231

MAF <- read.maf(MAF)
>
-Reading
-Validating
--Removed 63792 duplicated variants
-Silent variants: 74723 
-Summarizing
--Possible FLAGS among top ten genes:
  TTN
  MUC16
  SYNE1
  HMCN1
-Processing clinical data
--Missing clinical data

-Finished in 20.2s elapsed (27.9s cpu) 

> sum(MAF@variant.classification.summary$total)
[1] 172508 

When I read the MAF data without considering silent variants and defining nonSyn argument, I noticed discrepancies in the numbers. The total number of variants in MAF_silent increased, as I expected (247,231 vs. 172,508). However, I am puzzled by the fact that there are 63,792 silent variants in MAF_silent (Variant.Classification), whereas a larger number of variants (74,723) are shown in the same MAF without defining the nonSyn argument.

I would greatly appreciate any insights or explanations for this discrepancy. Thank you in advance for your time and assistance!

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity.