PoisonAlien / maftools

Summarize, Analyze and Visualize MAF files from TCGA or in-house studies.
http://bioconductor.org/packages/release/bioc/html/maftools.html
MIT License
450 stars 222 forks source link

tcgaCompare: definition of mutational load #366

Closed ishida-md closed 4 years ago

ishida-md commented 5 years ago

Dear PoisonAlien

I have a question regarding tcgaCompare(). In tcga_cohort.txt.gz, I see a table of mutation counts for each sample. How are they calculated? Looking at the original maf, I guess these numbers are # of missense + # of nonsense muts.

tcgaCompare() plots sum of # INS/DELS/missense/nonsense counts for a given cohort. I am afraid that tcgaCompare() inflates the mutation counts for the cohort to be analyzed if my guess is correct.

PoisonAlien commented 5 years ago

Hi, Yes youre correct. Mutations for TCGA cohort consist of both synonymous and non-synonymous variants. These variants/counts are derived from TCGA MC3 project.

Could you give a bit more detail on inflated mutation counts? Total mutations burden included all sort of non-synonymous variants if I understand correctly.

ishida-md commented 5 years ago

Thank you for your response.

I could be wrong, but TCGA cohort seems to count only missense + nonsense. In contrary, for a given cohort, tcgaCompare() counts any mutations (indels, splice site, missense...).

Since the number of indels/splice site muts are relatively low compared to missense muts, this discrepancy does not hurt much. Nonetheless, it feels a little odd to use different standards for a given cohort and TCGA.

PoisonAlien commented 5 years ago

I am not completely sure either. I will have to read more. Please do let me know if you find any reference. Another way is to make an argument to the function on what kind of variants should be included in comparison - which I think helps in this case.

ishida-md commented 5 years ago

As you may be aware, there is no standard definition for TMB. Assuming tcgaCompare() is designed for exome data, counting only missense muts would be the common practice according to the review article below (table 1).

Chan TA, Yarchoan M, Jaffee E, et al. Development of tumor mutation burden as an immunotherapy biomarker: utility for the oncology clinic. Ann Oncol 2019; 30:44–56.

carladosanjos commented 5 years ago

Dear PoisonAlien,

Is there a way to set y axis as 'number of mutations/Mb' (with corresponding values of median mutations) instead of 'log10(variant per sample)?

PoisonAlien commented 5 years ago

Unfortunately it can't be changed for now, but I am happy to rename it. I'll update it soon.

Edit: Done. You will have to re-install from GitHub for the update.

carladosanjos commented 5 years ago

Dear PoisonAlien,

Many thanks for this - I did update it and it is working (almost) fine. The data.table is finally showing the Median_Mutations in Mb and the ylab plot is is displaying "Mutations per MB". However the y axis scale is still displaying the log10 values, so the median lines do not correspond to the median in MB; I mean, for instance the SKCM median line (red) should be around 8 (not 1). Can the y axis values be edited accordingly?

Thanks again!

PoisonAlien commented 5 years ago

Ah!! I see. I'm sorry, I misread the issue. I will correct it and let you know..

PoisonAlien commented 5 years ago

Hi @carladosanjos , Can you check this again, and let me know if this helps?

tomsuntom commented 5 years ago

Hi PoisonAlien, first want to say amazing package you have developed!! It is odd that per the Alexandrov paper figure https://www.nature.com/articles/nature12477/figures/1 lung squamous and lung adenocarcinoma essentially have ~10 mutations/Mb, very similar to melanoma. However, tcgaCompare says melanoma is 8.29 mutations/mb, but lung squamous only 4.3 and lung adeno 3.8 mutations/mb. This is very discrepant from the Alexandrov paper. Is there something wrong with the tcgaCompare algorithm?

PoisonAlien commented 5 years ago

Hi, Thank you and I am glad that you find the package useful. Regarding the discrepancies, I believe its due to the source from which the tcga mutation load were estimated. For example, for maftools I use the all non-synonymous variants from TCGA MC3 project. However, I am not sure on how they estimated TMB..

clersdom commented 3 years ago

Hi @PoisonAlien Having read the previous comments, it should be assumed that TMB calculation strategy withtcgaCompare is consistent between TCGA and the input data given, and that TMB is calculated taking into account only non-synonymous variants for the TCGA cohorts as well as for our input variants if we give as input a curated list of pathogenic non-synonymous variants? Juts wondering if TMB calculation will exclude any synonymous variant if given as input? Many thanks!

PoisonAlien commented 3 years ago

Hi @clersdom

It will only use non-synonymous variants (stored in data slots of MAF) and will not consider any synonymous variants (stored in maf.silent slot).

TCGA TMBs are estimated using below non-synonymous variants:

"Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site",
"Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del",
"In_Frame_Ins", "Missense_Mutation"

Edit: Remember that it also depends on how you define non-synonymous variants. By default read.maf will consider above categories as non-syn variants but, if you have your own list of non-syn variant classifications (i.e, using vc_nonSyn argument in read.maf), it might affect a bit.

clersdom commented 3 years ago

Great thanks @PoisonAlien That is very helpful.

majeedmj commented 3 years ago

Dear PoisonAlien, Many thanks for this amazing package ! i have some question please help . As per Alexandrov et al paper (https://www.nature.com/articles/nature12477) scale in Y axis showing 0 - 1000 , but when i used below code i am getting (-2 to +3 ) when i mentioned - capture_size = 50 or any value . My professor wondering , how is it possible -2 mutation ? can you please explain why scale is not exactly as of Alexandrov et al paper ? how can i get y- axis with range 0 - 1000 ? For your reference i have attached both figures (Paper and tcgacomapre() analysis ) above and below respectively. it would be great and highly appreciable

Alexandrov TMB

TMB

if you could help I used below code : tcgaCompare(maf = maf, cohortName = "MESO", tcga_cohorts = c("SKCM", "LUSC", "LUAD", "BLCA", "COAD", "DLBC", "STAD", "ESCA", "HNSC", "READ", "CESC", "LIHC", "UCEC", "OV", "KIRP", "GBM", "KIRC", "UCS", "SARC", "BRCA", "CHOL", "PAAD", "LGG", "ACC","PRAD", "KICH", "TGCT", "THYM", "LAML", "UVM", "THCA", "PCPG"), col = c("#1A5276", "red"), medianCol = "black", bg_col = c("#F8F9F9", "#808080"), capture_size = 50)

PoisonAlien commented 3 years ago

Hello @majeedmj

The y-axis is on a log10 scale. so -1 would correspond to 0.1 and -2 to 0.01 You can convert to natural scale as below,

> 10^-2
[1] 0.01
> 10^-1
[1] 0.1
> 10^0
[1] 1
> 10^1
[1] 10
> 10^2
[1] 100
majeedmj commented 3 years ago

Dear PoisonAlien, Thank you for your kind reply. i have another small query plz help : Question -1: Whether number of mutation showing are for concerned to exon region only or for whole genome ? Question -2: TMB issue param

As per above figure ; whether all four figures are same (Y- scale ; Number of mutation). first figure : i just kept default parameters -----> (Y - scale shows 0 to 5) second figure : i just added capture size =50 and logscale = True ----> (Y - scale shows -2 to 3) Third figure : i just added capture size =50 and logscale = False ----> (Y - scale shows 0 to 500 ... and also Y axis label still shows log10 ....why? ) Fouth figure : Alexandrov paper (whether all above figures Mutation number are same as this paper (Y scales))

can you confirm whether all figures are exactly same as that of Alexandrov paper ?

I really appreciate for your kind reply .

majeedmj commented 3 years ago

Dear PoisonAlien,

Can you please give suggestion on above query ? Thanks.

PoisonAlien commented 3 years ago

Hi, The log10 label in the third fig, must have been a bug. I will correct it. I don't know what you mean by exactly the same, because it depends on the tcga variants that are used, capture sizes, and if any hypermutated samples are removed by them, etc. But in terms of visualization second figure comes close to the paper.

majeedmj commented 3 years ago

Dear PoisonAlien,

In the First figure what it indicates 0 to 5 scale , What exactly is it ?

PoisonAlien commented 3 years ago

Hi, I can not seem to reproduce the first image. What version of the package are you using? Could you please post your sessionInfo and exact commands?

majeedmj commented 3 years ago

Thank you PoisonAlien,

please find below information

maftools version = 2.2.10

code : tcgaCompare(maf = maf, cohortName = "MESO", tcga_cohorts = c("SKCM", "LUSC", "LUAD", "BLCA", "COAD", "DLBC", "STAD", "ESCA", "HNSC", "READ", "CESC", "LIHC", "UCEC", "OV", "KIRP", "GBM", "KIRC", "UCS", "SARC", "BRCA", "CHOL", "PAAD", "LGG", "ACC","PRAD", "KICH", "TGCT", "THYM", "LAML", "UVM", "THCA", "PCPG"), col = c("#1A5276", "#1A5276"), medianCol = "black", bg_col = c("#F8F9F9", "#808080"))

Session Info 👍 :

sessionInfo() R version 3.6.2 (2019-12-12) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 19.04

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] DT_0.17 dplyr_1.0.4 SummarizedExperiment_1.16.1 [4] DelayedArray_0.12.3 BiocParallel_1.20.1 matrixStats_0.58.0
[7] Biobase_2.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
[10] IRanges_2.20.2 S4Vectors_0.24.4 BiocGenerics_0.32.0
[13] TCGAbiolinks_2.14.1 maftools_2.2.10

loaded via a namespace (and not attached): [1] readxl_1.3.1 backports_1.2.1 aroma.light_3.16.0
[4] BiocFileCache_1.10.2 plyr_1.8.6 selectr_0.4-2
[7] splines_3.6.2 ggplot2_3.3.3 sva_3.34.0
[10] digest_0.6.27 htmltools_0.5.1.1 foreach_1.5.1
[13] magrittr_2.0.1 memoise_2.0.0 doParallel_1.0.16
[16] openxlsx_4.2.3 limma_3.42.2 Biostrings_2.54.0
[19] readr_1.4.0 annotate_1.64.0 wordcloud_2.6
[22] R.utils_2.10.1 askpass_1.1 prettyunits_1.1.1
[25] jpeg_0.1-8.1 colorspace_2.0-0 blob_1.2.1
[28] rvest_0.3.6 rappdirs_0.3.3 ggrepel_0.9.1
[31] haven_2.3.1 xfun_0.21 crayon_1.4.1
[34] RCurl_1.98-1.2 jsonlite_1.7.2 genefilter_1.68.0
[37] zoo_1.8-8 survival_3.2-10 iterators_1.0.13
[40] glue_1.4.2 survminer_0.4.8 gtable_0.3.0
[43] zlibbioc_1.32.0 XVector_0.26.0 car_3.0-10
[46] abind_1.4-5 scales_1.1.1 DESeq_1.38.0
[49] DBI_1.1.1 edgeR_3.28.1 rstatix_0.7.0
[52] ggthemes_4.2.4 Rcpp_1.0.6 xtable_1.8-4
[55] progress_1.2.2 foreign_0.8-81 bit_4.0.4
[58] km.ci_0.5-2 htmlwidgets_1.5.3 httr_1.4.2
[61] RColorBrewer_1.1-2 ellipsis_0.3.1 pkgconfig_2.0.3
[64] XML_3.99-0.3 R.methodsS3_1.8.1 dbplyr_2.1.0
[67] locfit_1.5-9.4 tidyselect_1.1.0 rlang_0.4.10
[70] AnnotationDbi_1.48.0 munsell_0.5.0 cellranger_1.1.0
[73] tools_3.6.2 cachem_1.0.4 cli_2.3.0
[76] downloader_0.4 generics_0.1.0 RSQLite_2.2.3
[79] broom_0.7.4 stringr_1.4.0 fastmap_1.1.0
[82] knitr_1.31 bit64_4.0.5 zip_2.1.1
[85] survMisc_0.5.5 purrr_0.3.4 packrat_0.5.0
[88] nlme_3.1-152 EDASeq_2.20.0 R.oo_1.24.0
[91] postlogic_0.1.0.1 xml2_1.3.2 biomaRt_2.42.1
[94] rstudioapi_0.13 compiler_3.6.2 curl_4.3
[97] png_0.1-7 parsetools_0.1.3 testthat_3.0.2
[100] ggsignif_0.6.0 tibble_3.0.6 geneplotter_1.64.0
[103] stringi_1.5.3 GenomicFeatures_1.38.2 forcats_0.5.1
[106] lattice_0.20-41 Matrix_1.3-2 KMsurv_0.1-5
[109] vctrs_0.3.6 purrrogress_0.1.1 pillar_1.4.7
[112] lifecycle_1.0.0 data.table_1.13.6 bitops_1.0-6
[115] rtracklayer_1.46.0 R6_2.5.0 latticeExtra_0.6-29
[118] hwriter_1.3.2 ShortRead_1.44.3 gridExtra_2.3
[121] rio_0.5.16 pkgcond_0.1.0 codetools_0.2-18
[124] assertthat_0.2.1 openssl_1.4.3 GenomicAlignments_1.22.1 [127] Rsamtools_2.2.3 GenomeInfoDbData_1.2.2 mgcv_1.8-33
[130] hms_1.0.0 testextra_0.1.0.1 grid_3.6.2
[133] tidyr_1.1.2 carData_3.0-4 ggpubr_0.4.0

PoisonAlien commented 3 years ago

Ah! Could you please update the package? 2.2.10 is quite old, the current BC version is 2.6.05.

majeedmj commented 3 years ago

Thank you ,

How can i get precise number of mutations per megabase

PoisonAlien commented 3 years ago

This is why you will have to update the package. It returns all the details with some stats. edit: Could you please use the discussions feature if its not about bug reports?

majeedmj commented 3 years ago

Dear PoisonAlien, I really appreciate your quick reply ! Thank you . Yes sure i will use Discussions. I got Number median Mutaion by saving to a variable tcgacompare () function.