BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

TCGAanalyze_DEA for SKCM #529

Closed MartinBaGar closed 2 years ago

MartinBaGar commented 2 years ago

Hello ! First of all, thank you very much for your great tool, I haven't been able to use it successfully yet but it seems very helpful. I have been using R since 2 month for an internship , its the first time for me so I hope I won't ask stupid question.

I wanted to perform a DEA analyze for the TCGA-SKCM project but I'm stuck at the TCGAanalyze_DEA function that indicates "Error in edgeR::DGEList(counts = TOC, group = tumorType) : Length of 'group' must equal number of columns in 'counts'". I have been using the TCGAbiolinks vignette, the TCGAworflow vignette and the Case Studies to try to understand but I couldn't make it run properly. Also, I have run the Case Study n.1 on my device and it works perfectly so I think its related to the TCGA-SKCM project.

Here is my script:

query.skcm.full <- GDCquery(
  project = "TCGA-SKCM",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  sample.type = c("Primary Tumor","Solid Tissue Normal")
) 

GDCdownload(
    query = query.skcm.full,
    files.per.chunk = 100
)

skcm.exp <- GDCprepare(
    query = query.skcm.full, 
    save = TRUE, 
    save.filename = "skcmExp.rda"
)

# get subtype information
infomation.subtype <- TCGAquery_subtype(tumor = "SKCM")

# get clinical data
information.clinical <- GDCquery_clinic(project = "TCGA-SKCM",type = "clinical") 

# Which samples are Primary Tumor
samples.primary.tumour <- skcm.exp$barcode[skcm.exp$shortLetterCode == "TP"]

# which samples are solid tissue normal
samples.solid.tissue.normal <- skcm.exp$barcode[skcm.exp$shortLetterCode == "NT"]

dataPrep <- TCGAanalyze_Preprocessing(skcm.exp)

dataNorm <- TCGAanalyze_Normalization(
    tabDF = dataPrep,
    geneInfo = geneInfoHT,
)                

dataFilt <- TCGAanalyze_Filtering(
    tabDF = dataNorm,
    method = "quantile", 
    qnt.cut =  0.25
)   

dataDEGs <- TCGAanalyze_DEA(
    mat1 = dataFilt[,samples.solid.tissue.normal],
    mat2 = dataFilt[,samples.primary.tumour],
    Cond1type = "Normal",
    Cond2type = "Tumor",
    fdr.cut = 0.01 ,
    logFC.cut = 2,
    method = "glmLRT",
    pipeline = "edgeR"
)  

Here is my session info:

R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale: [1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8 LC_MONETARY=French_France.utf8 LC_NUMERIC=C
[5] LC_TIME=French_France.utf8

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] DT_0.23 dplyr_1.0.9 TCGAWorkflow_1.19.1 TCGAbiolinks_2.24.3
[5] maftools_2.12.0 SummarizedExperiment_1.26.1 Biobase_2.56.0 GenomicRanges_1.48.0
[9] GenomeInfoDb_1.32.2 IRanges_2.30.0 S4Vectors_0.34.0 BiocGenerics_0.42.0
[13] MatrixGenerics_1.8.1 matrixStats_0.62.0 edgeR_3.38.4 limma_3.52.2

loaded via a namespace (and not attached): [1] Hmisc_4.7-0 Rsamtools_2.12.0 foreach_1.5.2
[4] crayon_1.5.1 MASS_7.3-58 nlme_3.1-158
[7] backports_1.4.1 minet_3.54.0 GOSemSim_2.22.0
[10] rlang_1.0.4 XVector_0.36.0 filelock_1.0.2
[13] BiocParallel_1.30.3 rjson_0.2.21 c3net_1.1.1.1
[16] CNEr_1.32.0 bit64_4.0.5 glue_1.6.2
[19] pathview_1.36.0 poweRlaw_0.70.6 parallel_4.2.1
[22] AnnotationDbi_1.58.0 motifStack_1.40.0 DOSE_3.22.0
[25] tidyselect_1.1.2 XML_3.99-0.10 tidyr_1.2.0
[28] ggpubr_0.4.0 GenomicAlignments_1.32.0 xtable_1.8-4
[31] TCGAWorkflowData_1.20.0 magrittr_2.0.3 evaluate_0.16
[34] ggplot2_3.3.6 cli_3.3.0 zlibbioc_1.42.0
[37] hwriter_1.3.2.1 rstudioapi_0.13 MultiAssayExperiment_1.22.0
[40] rpart_4.1.16 fastmatch_1.1-3 ensembldb_2.20.2
[43] RJSONIO_1.3-1.6 treeio_1.20.2 shiny_1.7.2
[46] xfun_0.31 clue_0.3-61 cluster_2.1.3
[49] caTools_1.18.2 tidygraph_1.2.1 KEGGREST_1.36.3
[52] tibble_3.1.8 interactiveDisplayBase_1.34.0 ggrepel_0.9.1
[55] biovizBase_1.44.0 ape_5.6-2 TFMPvalue_0.0.8
[58] Biostrings_2.64.0 png_0.1-7 reshape_0.8.9
[61] bitops_1.0-7 ggforce_0.3.3 plyr_1.8.7
[64] AnnotationFilter_1.20.0 pracma_2.3.8 pillar_1.8.0
[67] gplots_3.1.3 GlobalOptions_0.1.2 cachem_1.0.6
[70] GenomicFeatures_1.48.3 GetoptLong_1.0.5 clusterProfiler_4.4.4
[73] vctrs_0.4.1 ellipsis_0.3.2 generics_0.1.3
[76] ELMER.data_2.20.0 tools_4.2.1 foreign_0.8-82
[79] munsell_0.5.0 tweenr_1.0.2 fgsea_1.22.0
[82] DelayedArray_0.22.0 fastmap_1.1.0 compiler_4.2.1
[85] abind_1.4-5 httpuv_1.6.5 rtracklayer_1.56.1
[88] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 gt_0.6.0 Gviz_1.40.1
[91] plotly_4.10.0 GenomeInfoDbData_1.2.8 gridExtra_2.3
[94] DNAcopy_1.70.0 lattice_0.20-45 deldir_1.0-6
[97] utf8_1.2.2 later_1.3.0 BiocFileCache_2.4.0
[100] jsonlite_1.8.0 scales_1.2.0 graph_1.74.0
[103] tidytree_0.4.0 pbapply_1.5-0 carData_3.0-5
[106] lazyeval_0.2.2 promises_1.2.0.1 car_3.1-0
[109] doParallel_1.0.17 RCircos_1.2.2 latticeExtra_0.6-30
[112] R.utils_2.12.0 checkmate_2.1.0 rmarkdown_2.15
[115] rGADEM_2.44.0 pander_0.6.5 dichromat_2.0-0.1
[118] downloader_0.4 BSgenome_1.64.0 igraph_1.3.4
[121] survival_3.3-1 yaml_2.3.5 plotrix_3.8-2
[124] htmltools_0.5.3 memoise_2.0.1 VariantAnnotation_1.42.1
[127] BiocIO_1.6.0 locfit_1.5-9.6 graphlayouts_0.8.0
[130] viridisLite_0.4.0 BSgenome.Hsapiens.UCSC.hg19_1.4.3 digest_0.6.29
[133] assertthat_0.2.1 mime_0.12 rappdirs_0.3.3
[136] RSQLite_2.2.15 yulab.utils_0.0.5 TCGAbiolinksGUI.data_1.16.0
[139] data.table_1.14.2 blob_1.2.3 R.oo_1.25.0
[142] TFBSTools_1.34.0 splines_4.2.1 Formula_1.2-4
[145] RaggedExperiment_1.20.0 AnnotationHub_3.4.0 ProtGenerics_1.28.0
[148] RCurl_1.98-1.8 broom_1.0.0 hms_1.1.1
[151] colorspace_2.0-3 base64enc_0.1-3 BiocManager_1.30.18
[154] shape_1.4.6 EDASeq_2.30.0 aplot_0.1.6
[157] ELMER_2.20.0 nnet_7.3-17 Rcpp_1.0.9
[160] circlize_0.4.15 enrichplot_1.16.1 fansi_1.0.3
[163] tzdb_0.3.0 ChIPseeker_1.32.0 R6_2.5.1
[166] grid_4.2.1 lifecycle_1.0.1 ShortRead_1.54.0
[169] curl_4.3.2 ggsignif_0.6.3 DO.db_2.9
[172] Matrix_1.4-1 qvalue_2.28.0 org.Hs.eg.db_3.15.0
[175] RColorBrewer_1.1-3 iterators_1.0.14 stringr_1.4.0
[178] htmlwidgets_1.5.4 polyclip_1.10-0 biomaRt_2.52.0
[181] purrr_0.3.4 shadowtext_0.1.2 gridGraphics_0.5-1
[184] seqLogo_1.62.0 rvest_1.0.2 ComplexHeatmap_2.12.1
[187] htmlTable_2.4.1 patchwork_1.1.1 KEGGgraph_1.56.0
[190] RTCGAToolbox_2.26.0 codetools_0.2-18 GO.db_3.15.0
[193] gtools_3.9.3 prettyunits_1.1.1 dbplyr_2.2.1
[196] R.methodsS3_1.8.2 gtable_0.3.0 DBI_1.1.3
[199] aroma.light_3.26.0 ggfun_0.0.6 httr_1.4.4
[202] KernSmooth_2.23-20 stringi_1.7.8 progress_1.2.2
[205] reshape2_1.4.4 farver_2.1.1 annotate_1.74.0
[208] viridis_0.6.2 ggthemes_4.2.4 Rgraphviz_2.40.0
[211] ggtree_3.4.2 xml2_1.3.3 boot_1.3-28
[214] restfulr_0.0.15 interp_1.1-3 ade4_1.7-19
[217] readr_2.1.2 ggplotify_0.1.0 BiocVersion_3.15.2
[220] bit_4.0.4 scatterpie_0.1.7 jpeg_0.1-9
[223] ggraph_2.0.5 pkgconfig_2.0.3 DirichletMultinomial_1.38.0
[226] rstatix_0.7.0 knitr_1.39

Thank you very much in advance ! Of coursen tell me if you need more information.

Best, Martin

tiagochst commented 2 years ago

Hi Martin,

Since samples.solid.tissue.normal is only one sample. You need to set drop=FALSE, dataFilt[,samples.solid.tissue.normal,drop = FALSE], because dataFilt[,samples.solid.tissue.normal,], will return a vector instead of a matrix.

library(TCGAbiolinks)
query.skcm.full <- GDCquery(
  project = "TCGA-SKCM",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  sample.type = c("Primary Tumor","Solid Tissue Normal")
) 

GDCdownload(
    query = query.skcm.full,
    files.per.chunk = 100
)

skcm.exp <- GDCprepare(
    query = query.skcm.full, 
    save = TRUE, 
    save.filename = "skcmExp.rda"
)

table(skcm.exp$definition)

# Which samples are Primary Tumor
samples.primary.tumour <- skcm.exp$barcode[skcm.exp$shortLetterCode == "TP"]

# which samples are solid tissue normal
samples.solid.tissue.normal <- skcm.exp$barcode[skcm.exp$shortLetterCode == "NT"]

dataPrep <- TCGAanalyze_Preprocessing(skcm.exp)

dataNorm <- TCGAanalyze_Normalization(
    tabDF = dataPrep,
    geneInfo = geneInfoHT,
)                

dataFilt <- TCGAanalyze_Filtering(
    tabDF = dataNorm,
    method = "quantile", 
    qnt.cut =  0.25
)   

dataDEGs <- TCGAanalyze_DEA(
    mat1 = dataFilt[,samples.solid.tissue.normal,drop = FALSE],
    mat2 = dataFilt[,samples.primary.tumour],
    Cond1type = "Normal",
    Cond2type = "Tumor",
    fdr.cut = 0.01 ,
    logFC.cut = 2,
    method = "glmLRT",
    pipeline = "edgeR"
)  
head(dataDEGs)
MartinBaGar commented 2 years ago

Thank you so much for your quick answer and for the solution, it's working perfectly ! I love your tool !