BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
284 stars 109 forks source link

Discrepancy of mutational landscape reported initially by TCGA-LAML and the TCGAbiolinks download? #585

Open ChristianRohde opened 1 year ago

ChristianRohde commented 1 year ago

Hi,

I used TCGAbiolinks to download mutations from the TCGA-LAML cohort. This is my code:

`query <- TCGAbiolinks::GDCquery( project = "TCGA-LAML", data.category = "Simple Nucleotide Variation", access = "open", data.type = "Masked Somatic Mutation", workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking" ) TCGAbiolinks::GDCdownload(query) maf <- TCGAbiolinks::GDCprepare(query)

MAF_TCGA <- maf %>% maftools::read.maf( isTCGA = TRUE ) %>% oncoplot(removeNonMutated = FALSE) `

The first issue is that I receive 131 samples based on the unique Tumor_Sample_Barcode. However if I load data to maftools with with isTCGA = TRUE only 124 samples remain. I am not sure how to explain this. Anyway this is the result:

TCGAbiolinks

I see that the MAF only includes very low number of cases for especially FLT3 mutations. While NPM1 and DNMT3A mutations are on top of the oncoplot, I still would have expected much higher case numbers considering the 2013 NEJM publication (https://pubmed.ncbi.nlm.nih.gov/23634996/ figure 1B):

Screenshot 2023-06-02 at 12 12 50

This is a bit disturbing. However, I still have an old MAF on my computer which I downloaded in 2015 (if the timestamp is correct) from TCGA's old web page called "genome.wustl.edu_LAML.IlluminaGA_DNASeq.Level_2.2.13.0.somatic.maf". This one includes 197 samples and the isTCGA parameter does not influence the number of samples in maftools. This file includes 39 In_Frame_Ins and 17 Missense_Mutation in case of FLT3. Furthermore I can subset the file with sample IDs from TCGAbiolinks and get an overlap of 119 samples: 19 In_Frame_Ins + 11 Missense_Mutation for FLT3 remain in these 119 samples. The oncoplot from this old MAF looks much more what I would have expected :

genome wustl edu_LAML IlluminaGA_DNASeq Level_2 2 13 0 somatic

Is there something wrong with my approach to download the data from TCGAbiolinks? Is it possible to change some TCGAbiolinks::GDCquery() command parameters to extract the mutations as reported initially?

Best, Christian