Closed ishida-md closed 4 years ago
Hi, Yes youre correct. Mutations for TCGA cohort consist of both synonymous and non-synonymous variants. These variants/counts are derived from TCGA MC3 project.
Could you give a bit more detail on inflated mutation counts? Total mutations burden included all sort of non-synonymous variants if I understand correctly.
Thank you for your response.
I could be wrong, but TCGA cohort seems to count only missense + nonsense. In contrary, for a given cohort, tcgaCompare() counts any mutations (indels, splice site, missense...).
Since the number of indels/splice site muts are relatively low compared to missense muts, this discrepancy does not hurt much. Nonetheless, it feels a little odd to use different standards for a given cohort and TCGA.
I am not completely sure either. I will have to read more. Please do let me know if you find any reference. Another way is to make an argument to the function on what kind of variants should be included in comparison - which I think helps in this case.
As you may be aware, there is no standard definition for TMB. Assuming tcgaCompare() is designed for exome data, counting only missense muts would be the common practice according to the review article below (table 1).
Chan TA, Yarchoan M, Jaffee E, et al. Development of tumor mutation burden as an immunotherapy biomarker: utility for the oncology clinic. Ann Oncol 2019; 30:44–56.
Dear PoisonAlien,
Is there a way to set y axis as 'number of mutations/Mb' (with corresponding values of median mutations) instead of 'log10(variant per sample)?
Unfortunately it can't be changed for now, but I am happy to rename it. I'll update it soon.
Edit: Done. You will have to re-install from GitHub for the update.
Dear PoisonAlien,
Many thanks for this - I did update it and it is working (almost) fine. The data.table is finally showing the Median_Mutations in Mb and the ylab plot is is displaying "Mutations per MB". However the y axis scale is still displaying the log10 values, so the median lines do not correspond to the median in MB; I mean, for instance the SKCM median line (red) should be around 8 (not 1). Can the y axis values be edited accordingly?
Thanks again!
Ah!! I see. I'm sorry, I misread the issue. I will correct it and let you know..
Hi @carladosanjos , Can you check this again, and let me know if this helps?
Hi PoisonAlien, first want to say amazing package you have developed!! It is odd that per the Alexandrov paper figure https://www.nature.com/articles/nature12477/figures/1 lung squamous and lung adenocarcinoma essentially have ~10 mutations/Mb, very similar to melanoma. However, tcgaCompare says melanoma is 8.29 mutations/mb, but lung squamous only 4.3 and lung adeno 3.8 mutations/mb. This is very discrepant from the Alexandrov paper. Is there something wrong with the tcgaCompare algorithm?
Hi,
Thank you and I am glad that you find the package useful. Regarding the discrepancies, I believe its due to the source from which the tcga mutation load were estimated. For example, for maftools
I use the all non-synonymous variants from TCGA MC3 project. However, I am not sure on how they estimated TMB..
Hi @PoisonAlien
Having read the previous comments, it should be assumed that TMB calculation strategy withtcgaCompare
is consistent between TCGA and the input data given, and that TMB is calculated taking into account only non-synonymous variants for the TCGA cohorts as well as for our input variants if we give as input a curated list of pathogenic non-synonymous variants? Juts wondering if TMB calculation will exclude any synonymous variant if given as input? Many thanks!
Hi @clersdom
It will only use non-synonymous variants (stored in data
slots of MAF) and will not consider any synonymous variants (stored in maf.silent
slot).
TCGA TMBs are estimated using below non-synonymous variants:
"Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site",
"Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del",
"In_Frame_Ins", "Missense_Mutation"
Edit: Remember that it also depends on how you define non-synonymous variants. By default read.maf
will consider above categories as non-syn variants but, if you have your own list of non-syn variant classifications (i.e, using vc_nonSyn
argument in read.maf
), it might affect a bit.
Great thanks @PoisonAlien That is very helpful.
Dear PoisonAlien, Many thanks for this amazing package ! i have some question please help . As per Alexandrov et al paper (https://www.nature.com/articles/nature12477) scale in Y axis showing 0 - 1000 , but when i used below code i am getting (-2 to +3 ) when i mentioned - capture_size = 50 or any value . My professor wondering , how is it possible -2 mutation ? can you please explain why scale is not exactly as of Alexandrov et al paper ? how can i get y- axis with range 0 - 1000 ? For your reference i have attached both figures (Paper and tcgacomapre() analysis ) above and below respectively. it would be great and highly appreciable
if you could help
I used below code :
tcgaCompare(maf = maf, cohortName = "MESO", tcga_cohorts = c("SKCM", "LUSC", "LUAD", "BLCA", "COAD", "DLBC", "STAD", "ESCA", "HNSC", "READ", "CESC", "LIHC", "UCEC", "OV", "KIRP", "GBM", "KIRC", "UCS", "SARC", "BRCA", "CHOL", "PAAD", "LGG", "ACC","PRAD", "KICH", "TGCT", "THYM", "LAML", "UVM", "THCA", "PCPG"), col = c("#1A5276", "red"), medianCol = "black", bg_col = c("#F8F9F9", "#808080"), capture_size = 50)
Hello @majeedmj
The y-axis is on a log10 scale. so -1
would correspond to 0.1 and -2
to 0.01
You can convert to natural scale as below,
> 10^-2
[1] 0.01
> 10^-1
[1] 0.1
> 10^0
[1] 1
> 10^1
[1] 10
> 10^2
[1] 100
Dear PoisonAlien, Thank you for your kind reply. i have another small query plz help : Question -1: Whether number of mutation showing are for concerned to exon region only or for whole genome ? Question -2:
As per above figure ; whether all four figures are same (Y- scale ; Number of mutation). first figure : i just kept default parameters -----> (Y - scale shows 0 to 5) second figure : i just added capture size =50 and logscale = True ----> (Y - scale shows -2 to 3) Third figure : i just added capture size =50 and logscale = False ----> (Y - scale shows 0 to 500 ... and also Y axis label still shows log10 ....why? ) Fouth figure : Alexandrov paper (whether all above figures Mutation number are same as this paper (Y scales))
can you confirm whether all figures are exactly same as that of Alexandrov paper ?
I really appreciate for your kind reply .
Dear PoisonAlien,
Can you please give suggestion on above query ? Thanks.
Hi,
The log10
label in the third fig, must have been a bug. I will correct it.
I don't know what you mean by exactly the same
, because it depends on the tcga variants that are used, capture sizes, and if any hypermutated samples are removed by them, etc.
But in terms of visualization second figure comes close to the paper.
Dear PoisonAlien,
In the First figure what it indicates 0 to 5 scale , What exactly is it ?
Hi, I can not seem to reproduce the first image. What version of the package are you using? Could you please post your sessionInfo and exact commands?
Thank you PoisonAlien,
please find below information
maftools version = 2.2.10
code : tcgaCompare(maf = maf, cohortName = "MESO", tcga_cohorts = c("SKCM", "LUSC", "LUAD", "BLCA", "COAD", "DLBC", "STAD", "ESCA", "HNSC", "READ", "CESC", "LIHC", "UCEC", "OV", "KIRP", "GBM", "KIRC", "UCS", "SARC", "BRCA", "CHOL", "PAAD", "LGG", "ACC","PRAD", "KICH", "TGCT", "THYM", "LAML", "UVM", "THCA", "PCPG"), col = c("#1A5276", "#1A5276"), medianCol = "black", bg_col = c("#F8F9F9", "#808080"))
Session Info 👍 :
sessionInfo() R version 3.6.2 (2019-12-12) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 19.04
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DT_0.17 dplyr_1.0.4 SummarizedExperiment_1.16.1
[4] DelayedArray_0.12.3 BiocParallel_1.20.1 matrixStats_0.58.0
[7] Biobase_2.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
[10] IRanges_2.20.2 S4Vectors_0.24.4 BiocGenerics_0.32.0
[13] TCGAbiolinks_2.14.1 maftools_2.2.10
loaded via a namespace (and not attached):
[1] readxl_1.3.1 backports_1.2.1 aroma.light_3.16.0
[4] BiocFileCache_1.10.2 plyr_1.8.6 selectr_0.4-2
[7] splines_3.6.2 ggplot2_3.3.3 sva_3.34.0
[10] digest_0.6.27 htmltools_0.5.1.1 foreach_1.5.1
[13] magrittr_2.0.1 memoise_2.0.0 doParallel_1.0.16
[16] openxlsx_4.2.3 limma_3.42.2 Biostrings_2.54.0
[19] readr_1.4.0 annotate_1.64.0 wordcloud_2.6
[22] R.utils_2.10.1 askpass_1.1 prettyunits_1.1.1
[25] jpeg_0.1-8.1 colorspace_2.0-0 blob_1.2.1
[28] rvest_0.3.6 rappdirs_0.3.3 ggrepel_0.9.1
[31] haven_2.3.1 xfun_0.21 crayon_1.4.1
[34] RCurl_1.98-1.2 jsonlite_1.7.2 genefilter_1.68.0
[37] zoo_1.8-8 survival_3.2-10 iterators_1.0.13
[40] glue_1.4.2 survminer_0.4.8 gtable_0.3.0
[43] zlibbioc_1.32.0 XVector_0.26.0 car_3.0-10
[46] abind_1.4-5 scales_1.1.1 DESeq_1.38.0
[49] DBI_1.1.1 edgeR_3.28.1 rstatix_0.7.0
[52] ggthemes_4.2.4 Rcpp_1.0.6 xtable_1.8-4
[55] progress_1.2.2 foreign_0.8-81 bit_4.0.4
[58] km.ci_0.5-2 htmlwidgets_1.5.3 httr_1.4.2
[61] RColorBrewer_1.1-2 ellipsis_0.3.1 pkgconfig_2.0.3
[64] XML_3.99-0.3 R.methodsS3_1.8.1 dbplyr_2.1.0
[67] locfit_1.5-9.4 tidyselect_1.1.0 rlang_0.4.10
[70] AnnotationDbi_1.48.0 munsell_0.5.0 cellranger_1.1.0
[73] tools_3.6.2 cachem_1.0.4 cli_2.3.0
[76] downloader_0.4 generics_0.1.0 RSQLite_2.2.3
[79] broom_0.7.4 stringr_1.4.0 fastmap_1.1.0
[82] knitr_1.31 bit64_4.0.5 zip_2.1.1
[85] survMisc_0.5.5 purrr_0.3.4 packrat_0.5.0
[88] nlme_3.1-152 EDASeq_2.20.0 R.oo_1.24.0
[91] postlogic_0.1.0.1 xml2_1.3.2 biomaRt_2.42.1
[94] rstudioapi_0.13 compiler_3.6.2 curl_4.3
[97] png_0.1-7 parsetools_0.1.3 testthat_3.0.2
[100] ggsignif_0.6.0 tibble_3.0.6 geneplotter_1.64.0
[103] stringi_1.5.3 GenomicFeatures_1.38.2 forcats_0.5.1
[106] lattice_0.20-41 Matrix_1.3-2 KMsurv_0.1-5
[109] vctrs_0.3.6 purrrogress_0.1.1 pillar_1.4.7
[112] lifecycle_1.0.0 data.table_1.13.6 bitops_1.0-6
[115] rtracklayer_1.46.0 R6_2.5.0 latticeExtra_0.6-29
[118] hwriter_1.3.2 ShortRead_1.44.3 gridExtra_2.3
[121] rio_0.5.16 pkgcond_0.1.0 codetools_0.2-18
[124] assertthat_0.2.1 openssl_1.4.3 GenomicAlignments_1.22.1
[127] Rsamtools_2.2.3 GenomeInfoDbData_1.2.2 mgcv_1.8-33
[130] hms_1.0.0 testextra_0.1.0.1 grid_3.6.2
[133] tidyr_1.1.2 carData_3.0-4 ggpubr_0.4.0
Ah! Could you please update the package? 2.2.10 is quite old, the current BC version is 2.6.05.
Thank you ,
How can i get precise number of mutations per megabase
This is why you will have to update the package. It returns all the details with some stats. edit: Could you please use the discussions feature if its not about bug reports?
Dear PoisonAlien, I really appreciate your quick reply ! Thank you . Yes sure i will use Discussions. I got Number median Mutaion by saving to a variable tcgacompare () function.
Dear PoisonAlien
I have a question regarding tcgaCompare(). In tcga_cohort.txt.gz, I see a table of mutation counts for each sample. How are they calculated? Looking at the original maf, I guess these numbers are # of missense + # of nonsense muts.
tcgaCompare() plots sum of # INS/DELS/missense/nonsense counts for a given cohort. I am afraid that tcgaCompare() inflates the mutation counts for the cohort to be analyzed if my guess is correct.