TCGAanalyze_DEA for SKCM

MartinBaGar commented 2 years ago

Hello ! First of all, thank you very much for your great tool, I haven't been able to use it successfully yet but it seems very helpful. I have been using R since 2 month for an internship , its the first time for me so I hope I won't ask stupid question.

I wanted to perform a DEA analyze for the TCGA-SKCM project but I'm stuck at the TCGAanalyze_DEA function that indicates "Error in edgeR::DGEList(counts = TOC, group = tumorType) : Length of 'group' must equal number of columns in 'counts'". I have been using the TCGAbiolinks vignette, the TCGAworflow vignette and the Case Studies to try to understand but I couldn't make it run properly. Also, I have run the Case Study n.1 on my device and it works perfectly so I think its related to the TCGA-SKCM project.

Here is my script:

query.skcm.full <- GDCquery(
  project = "TCGA-SKCM",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  sample.type = c("Primary Tumor","Solid Tissue Normal")
) 

GDCdownload(
    query = query.skcm.full,
    files.per.chunk = 100
)

skcm.exp <- GDCprepare(
    query = query.skcm.full, 
    save = TRUE, 
    save.filename = "skcmExp.rda"
)

# get subtype information
infomation.subtype <- TCGAquery_subtype(tumor = "SKCM")

# get clinical data
information.clinical <- GDCquery_clinic(project = "TCGA-SKCM",type = "clinical") 

# Which samples are Primary Tumor
samples.primary.tumour <- skcm.exp$barcode[skcm.exp$shortLetterCode == "TP"]

# which samples are solid tissue normal
samples.solid.tissue.normal <- skcm.exp$barcode[skcm.exp$shortLetterCode == "NT"]

dataPrep <- TCGAanalyze_Preprocessing(skcm.exp)

dataNorm <- TCGAanalyze_Normalization(
    tabDF = dataPrep,
    geneInfo = geneInfoHT,
)                

dataFilt <- TCGAanalyze_Filtering(
    tabDF = dataNorm,
    method = "quantile", 
    qnt.cut =  0.25
)   

dataDEGs <- TCGAanalyze_DEA(
    mat1 = dataFilt[,samples.solid.tissue.normal],
    mat2 = dataFilt[,samples.primary.tumour],
    Cond1type = "Normal",
    Cond2type = "Tumor",
    fdr.cut = 0.01 ,
    logFC.cut = 2,
    method = "glmLRT",
    pipeline = "edgeR"
)

Here is my session info:

R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale: [1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8 LC_MONETARY=French_France.utf8 LC_NUMERIC=C
[5] LC_TIME=French_France.utf8

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] DT_0.23 dplyr_1.0.9 TCGAWorkflow_1.19.1 TCGAbiolinks_2.24.3
[5] maftools_2.12.0 SummarizedExperiment_1.26.1 Biobase_2.56.0 GenomicRanges_1.48.0
[9] GenomeInfoDb_1.32.2 IRanges_2.30.0 S4Vectors_0.34.0 BiocGenerics_0.42.0
[13] MatrixGenerics_1.8.1 matrixStats_0.62.0 edgeR_3.38.4 limma_3.52.2

loaded via a [1] Hmisc_4.7-0 [4] crayon_1.5.1 [7] backports_1.4.1 [10] rlang_1.0.4 [13] BiocParallel_1.30.3 [16] CNEr_1.32.0 [19] pathview_1.36.0 [22] AnnotationDbi_1.58.0 [25] tidyselect_1.1.2 [28] ggpubr_0.4.0 [31] TCGAWorkflowData_1.20.0 [34] ggplot2_3.3.6 [37] hwriter_1.3.2.1 [40] rpart_4.1.16 [43] RJSONIO_1.3-1.6 [46] xfun_0.31 [49] caTools_1.18.2 [52] tibble_3.1.8 [55] biovizBase_1.44.0 [58] Biostrings_2.64.0 [61] bitops_1.0-7 [64] AnnotationFilter_1.20.0 [67] gplots_3.1.3 [70] GenomicFeatures_1.48.3 [73] vctrs_0.4.1 [76] ELMER.data_2.20.0 [79] munsell_0.5.0 [82] DelayedArray_0.22.0 [85] abind_1.4-5 [88] TxDb.Hsapiens.UC [91] plotly_4.10.0 [94] DNAcopy_1.70.0 [97] utf8_1.2.2 [100] jsonlite_1.8.0 [103] tidytree_0.4.0 [106] lazyeval_0.2.2 [109] doParallel_1.0.17 [112] R.utils_2.12.0 [115] rGADEM_2.44.0 [118] downloader_0.4 [121] survival_3.3-1 [124] htmltools_0.5.3 [127] BiocIO_1.6.0 [130] viridisLite_0.4.0 [133] assertthat_0.2.1 [136] RSQLite_2.2.15 [139] data.table_1.14.2 [142] TFBSTools_1.34.0 [145] RaggedExperiment_1.20.0 [148] RCurl_1.98-1.8 [151] colorspace_2.0-3 [154] shape_1.4.6 [157] ELMER_2.20.0 [160] circlize_0.4.15 [163] tzdb_0.3.0 [166] grid_4.2.1 [169] curl_4.3.2 [172] Matrix_1.4-1 [175] RColorBrewer_1.1-3 [178] htmlwidgets_1.5.4 [181] purrr_0.3.4 [184] seqLogo_1.62.0 [187] htmlTable_2.4.1 [190] RTCGAToolbox_2.26.0 [193] gtools_3.9.3 [196] R.methodsS3_1.8.2 [199] aroma.light_3.26.0 [202] KernSmooth_2.23-20 [205] reshape2_1.4.4 [208] viridis_0.6.2 [211] ggtree_3.4.2 [214] restfulr_0.0.15 [217] readr_2.1.2 [220] bit_4.0.4 [223] ggraph_2.0.5 [226] rstatix_0.7.0 namespace (and not attached): Rsamtools_2.12.0 foreach_1.5.2
MASS_7.3-58 nlme_3.1-158
minet_3.54.0 GOSemSim_2.22.0
XVector_0.36.0 filelock_1.0.2
rjson_0.2.21 c3net_1.1.1.1
bit64_4.0.5 glue_1.6.2
poweRlaw_0.70.6 parallel_4.2.1
motifStack_1.40.0 DOSE_3.22.0
XML_3.99-0.10 tidyr_1.2.0
GenomicAlignments_1.32.0 xtable_1.8-4
magrittr_2.0.3 evaluate_0.16
cli_3.3.0 zlibbioc_1.42.0
rstudioapi_0.13 MultiAssayExperiment_1.22.0
fastmatch_1.1-3 ensembldb_2.20.2
treeio_1.20.2 shiny_1.7.2
clue_0.3-61 cluster_2.1.3
tidygraph_1.2.1 KEGGREST_1.36.3
interactiveDisplayBase_1.34.0 ggrepel_0.9.1
ape_5.6-2 TFMPvalue_0.0.8
png_0.1-7 reshape_0.8.9
ggforce_0.3.3 plyr_1.8.7
pracma_2.3.8 pillar_1.8.0
GlobalOptions_0.1.2 cachem_1.0.6
GetoptLong_1.0.5 clusterProfiler_4.4.4
ellipsis_0.3.2 generics_0.1.3
tools_4.2.1 foreign_0.8-82
tweenr_1.0.2 fgsea_1.22.0
fastmap_1.1.0 compiler_4.2.1
httpuv_1.6.5 rtracklayer_1.56.1
SC.hg19.knownGene_3.2.2 gt_0.6.0 Gviz_1.40.1
GenomeInfoDbData_1.2.8 gridExtra_2.3
lattice_0.20-45 deldir_1.0-6
later_1.3.0 BiocFileCache_2.4.0
scales_1.2.0 graph_1.74.0
pbapply_1.5-0 carData_3.0-5
promises_1.2.0.1 car_3.1-0
RCircos_1.2.2 latticeExtra_0.6-30
checkmate_2.1.0 rmarkdown_2.15
pander_0.6.5 dichromat_2.0-0.1
BSgenome_1.64.0 igraph_1.3.4
yaml_2.3.5 plotrix_3.8-2
memoise_2.0.1 VariantAnnotation_1.42.1
locfit_1.5-9.6 graphlayouts_0.8.0
BSgenome.Hsapiens.UCSC.hg19_1.4.3 digest_0.6.29
mime_0.12 rappdirs_0.3.3
yulab.utils_0.0.5 TCGAbiolinksGUI.data_1.16.0
blob_1.2.3 R.oo_1.25.0
splines_4.2.1 Formula_1.2-4
AnnotationHub_3.4.0 ProtGenerics_1.28.0
broom_1.0.0 hms_1.1.1
base64enc_0.1-3 BiocManager_1.30.18
EDASeq_2.30.0 aplot_0.1.6
nnet_7.3-17 Rcpp_1.0.9
enrichplot_1.16.1 fansi_1.0.3
ChIPseeker_1.32.0 R6_2.5.1
lifecycle_1.0.1 ShortRead_1.54.0
ggsignif_0.6.3 DO.db_2.9
qvalue_2.28.0 org.Hs.eg.db_3.15.0
iterators_1.0.14 stringr_1.4.0
polyclip_1.10-0 biomaRt_2.52.0
shadowtext_0.1.2 gridGraphics_0.5-1
rvest_1.0.2 ComplexHeatmap_2.12.1
patchwork_1.1.1 KEGGgraph_1.56.0
codetools_0.2-18 GO.db_3.15.0
prettyunits_1.1.1 dbplyr_2.2.1
gtable_0.3.0 DBI_1.1.3
ggfun_0.0.6 httr_1.4.4
stringi_1.7.8 progress_1.2.2
farver_2.1.1 annotate_1.74.0
ggthemes_4.2.4 Rgraphviz_2.40.0
xml2_1.3.3 boot_1.3-28
interp_1.1-3 ade4_1.7-19
ggplotify_0.1.0 BiocVersion_3.15.2
scatterpie_0.1.7 jpeg_0.1-9
pkgconfig_2.0.3 DirichletMultinomial_1.38.0
knitr_1.39

Thank you very much in advance ! Of coursen tell me if you need more information.

Best, Martin

tiagochst commented 2 years ago

Hi Martin,

Since samples.solid.tissue.normal is only one sample. You need to set drop=FALSE, dataFilt[,samples.solid.tissue.normal,drop = FALSE], because dataFilt[,samples.solid.tissue.normal,], will return a vector instead of a matrix.

library(TCGAbiolinks)
query.skcm.full <- GDCquery(
  project = "TCGA-SKCM",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  sample.type = c("Primary Tumor","Solid Tissue Normal")
) 

GDCdownload(
    query = query.skcm.full,
    files.per.chunk = 100
)

skcm.exp <- GDCprepare(
    query = query.skcm.full, 
    save = TRUE, 
    save.filename = "skcmExp.rda"
)

table(skcm.exp$definition)

# Which samples are Primary Tumor
samples.primary.tumour <- skcm.exp$barcode[skcm.exp$shortLetterCode == "TP"]

# which samples are solid tissue normal
samples.solid.tissue.normal <- skcm.exp$barcode[skcm.exp$shortLetterCode == "NT"]

dataPrep <- TCGAanalyze_Preprocessing(skcm.exp)

dataNorm <- TCGAanalyze_Normalization(
    tabDF = dataPrep,
    geneInfo = geneInfoHT,
)                

dataFilt <- TCGAanalyze_Filtering(
    tabDF = dataNorm,
    method = "quantile", 
    qnt.cut =  0.25
)   

dataDEGs <- TCGAanalyze_DEA(
    mat1 = dataFilt[,samples.solid.tissue.normal,drop = FALSE],
    mat2 = dataFilt[,samples.primary.tumour],
    Cond1type = "Normal",
    Cond2type = "Tumor",
    fdr.cut = 0.01 ,
    logFC.cut = 2,
    method = "glmLRT",
    pipeline = "edgeR"
)  
head(dataDEGs)

MartinBaGar commented 2 years ago

Thank you so much for your quick answer and for the solution, it's working perfectly ! I love your tool !

BioinformaticsFMRP / TCGAbiolinks

TCGAanalyze_DEA for SKCM #529