BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

GDCquery error when querying mutation data #498

Closed komalsrathi closed 2 years ago

komalsrathi commented 2 years ago

I used to query the TCGA mutation file using:

query <- GDCquery(project = "TCGA-GBM", 
                  data.category = "Simple Nucleotide Variation", 
                  access = "open", 
                  legacy = F, 
                  data.type = "Masked Somatic Mutation", 
                  workflow.type = "MuTect2 Variant Aggregation and Masking")

This is giving me the following error:

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-GBM
--------------------
oo Filtering results
--------------------
ooo By access
ooo By data.type
Error in GDCquery(project = "TCGA-GBM", data.category = "Simple Nucleotide Variation",  : 
  Please set a valid workflow.type argument from the list below:
  => Aliquot Ensemble Somatic Variant Merging and Masking

I also tried the following but getting the same error:

maf <- GDCquery_Maf(tumor = "GBM", pipelines = "mutect2")

============================================================================
 For more information about MAF data please read the following GDC manual and web pages:
 GDC manual: https://gdc-docs.nci.nih.gov/Data/PDF/Data_UG.pdf
 https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/
 https://gdc.cancer.gov/about-gdc/variant-calling-gdc
============================================================================
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-GBM
--------------------
oo Filtering results
--------------------
ooo By access
ooo By data.type
Error in GDCquery(paste0("TCGA-", tumor), data.category = "Simple Nucleotide Variation",  : 
  Please set a valid workflow.type argument from the list below:
  => Aliquot Ensemble Somatic Variant Merging and Masking

Session Info:

> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] TCGAbiolinks_2.22.4

loaded via a namespace (and not attached):
 [1] MatrixGenerics_1.4.3        Biobase_2.52.0              httr_1.4.2                 
 [4] tidyr_1.1.4                 bit64_4.0.5                 jsonlite_1.7.3             
 [7] R.utils_2.11.0              assertthat_0.2.1            BiocManager_1.30.16        
[10] highr_0.9                   stats4_4.1.3                BiocFileCache_2.0.0        
[13] blob_1.2.2                  GenomeInfoDbData_1.2.6      progress_1.2.2             
[16] pillar_1.6.5                RSQLite_2.2.9               lattice_0.20-45            
[19] glue_1.6.1                  downloader_0.4              digest_0.6.29              
[22] GenomicRanges_1.44.0        XVector_0.32.0              rvest_1.0.2                
[25] colorspace_2.0-2            plyr_1.8.6                  Matrix_1.4-0               
[28] R.oo_1.24.0                 XML_3.99-0.8                pkgconfig_2.0.3            
[31] biomaRt_2.48.3              zlibbioc_1.38.0             purrr_0.3.4                
[34] scales_1.1.1                tzdb_0.2.0                  tibble_3.1.6               
[37] KEGGREST_1.32.0             generics_0.1.1              TCGAbiolinksGUI.data_1.12.0
[40] IRanges_2.26.0              ggplot2_3.3.5               ellipsis_0.3.2             
[43] cachem_1.0.6                SummarizedExperiment_1.22.0 BiocGenerics_0.38.0        
[46] cli_3.1.1                   magrittr_2.0.2              crayon_1.4.2               
[49] memoise_2.0.1               R.methodsS3_1.8.1           fansi_1.0.2                
[52] xml2_1.3.3                  tools_4.1.3                 data.table_1.14.2          
[55] prettyunits_1.1.1           hms_1.1.1                   lifecycle_1.0.1            
[58] matrixStats_0.61.0          stringr_1.4.0               S4Vectors_0.30.2           
[61] munsell_0.5.0               DelayedArray_0.18.0         AnnotationDbi_1.54.1       
[64] Biostrings_2.60.2           compiler_4.1.3              GenomeInfoDb_1.28.4        
[67] rlang_1.0.0                 grid_4.1.3                  RCurl_1.98-1.5             
[70] rstudioapi_0.13             rappdirs_0.3.3              bitops_1.0-7               
[73] gtable_0.3.0                DBI_1.1.2                   curl_4.3.2                 
[76] R6_2.5.1                    knitr_1.37                  dplyr_1.0.7                
[79] fastmap_1.1.0               bit_4.0.4                   utf8_1.2.2                 
[82] rprojroot_2.0.2             filelock_1.0.2              readr_2.1.2                
[85] stringi_1.7.6               parallel_4.1.3              Rcpp_1.0.8                 
[88] vctrs_0.3.8                 png_0.1-7                   dbplyr_2.1.1               
[91] tidyselect_1.1.1            xfun_0.29   

Can you please suggest an alternative how to obtain the Mutect2 output for TCGA GBM data?

674040463 commented 2 years ago

Did you solve your problem? I also encountered the same problem

komalsrathi commented 2 years ago

No, I haven't found a solution.

674040463 commented 2 years ago

Hi, the problem may be the directory of GDCprepare. I found some solutions. https://support.bioconductor.org/p/9143021/

tiagochst commented 2 years ago

@komalsrathi It seems GDC changed the data. The output description is here: https://github.com/NCI-GDC/gdc-maf-tool The code below should be working with the latest version of TCGAbioinks (2.23.12) which can be updated with

BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

I am removing GDCquery_Maf from the package.

query <- GDCquery(
     project = "TCGA-CHOL", 
     data.category = "Simple Nucleotide Variation", 
     access = "open", 
     legacy = FALSE, 
     data.type = "Masked Somatic Mutation", 
     workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
)
GDCdownload(query)
maf <- GDCprepare(query)

Also, there is a column with the callers:

Screen Shot 2022-04-17 at 11 37 36 PM
komalsrathi commented 2 years ago

@tiagochst Thank you so much!! Closing as it works now.