Error in obtaining HTSeq - count data for TCGA-PRAD

ArvinZaker commented 2 years ago

Hello,

I was following the tutorial provided on the TCGAbiolinks website to download the HTSeq data from TCGA Prostate Adenocarcinoma dataset.

The code I ran was the same as in the tutorial, except for the project which was changed to TCGA-PRAD:

library(TCGAbiolinks)
query <- GDCquery(
  project = "TCGA-PRAD",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification", 
  workflow.type = "STAR - Counts"
)

samplesDown <- getResults(query,cols=c("cases"))

dataSmTP <- TCGAquery_SampleTypes(
  barcode = samplesDown,
  typesample = "TP"
)

dataSmNT <- TCGAquery_SampleTypes(
  barcode = samplesDown,
  typesample = "NT"
)
dataSmTP_short <- dataSmTP[1:10]
dataSmNT_short <- dataSmNT[1:10]

query.selected.samples <- GDCquery(
  project = "TCGA-PRAD", 
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification", 
  workflow.type = "STAR - Counts", 
  barcode = c(dataSmTP_short, dataSmNT_short)
)

GDCdownload(
  query = query.selected.samples
)

dataPrep <- GDCprepare(
  query = query.selected.samples, 
  save = TRUE
)

dataPrep <- TCGAanalyze_Preprocessing(
  object = dataPrep, 
  cor.cut = 0.6,
  datatype = "HTSeq - Counts"
)

Executing the TCGAanalyze_Preprocessing function results in an error that mentions that HTSeq data was not provided:

Error in TCGAanalyze_Preprocessing(object = dataPrep, cor.cut = 0.6, datatype = "HTSeq - Counts") : 
  HTSeq - Counts not found in the assay list: unstranded, stranded_first, stranded_second, tpm_unstrand, fpkm_unstrand, fpkm_uq_unstrand
  Please set the correct datatype argument.

I would like to know what changes are needed to correct this error and extract the HTSeq - count data.

Here is the sessionInfo() :

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.10.1
LAPACK: /usr/lib/liblapack.so.3.10.1

locale:
 [1] LC_CTYPE=en_US.UTF-8      
 [2] LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                 
 [9] LC_ADDRESS=C              
[10] LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8
[12] LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices
[4] utils     datasets  methods  
[7] base     

other attached packages:
[1] TCGAbiolinks_2.24.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9                 
 [2] lattice_0.20-45            
 [3] tidyr_1.2.0                
 [4] prettyunits_1.1.1          
 [5] png_0.1-7                  
 [6] Biostrings_2.64.0          
 [7] assertthat_0.2.1           
 [8] digest_0.6.29              
 [9] utf8_1.2.2                 
[10] BiocFileCache_2.4.0        
[11] plyr_1.8.7                 
[12] R6_2.5.1                   
[13] GenomeInfoDb_1.32.2        
[14] stats4_4.2.1               
[15] RSQLite_2.2.15             
[16] httr_1.4.3                 
[17] ggplot2_3.3.6              
[18] pillar_1.8.0               
[19] zlibbioc_1.42.0            
[20] rlang_1.0.4                
[21] progress_1.2.2             
[22] curl_4.3.2                 
[23] data.table_1.14.2          
[24] rstudioapi_0.13            
[25] blob_1.2.3                 
[26] S4Vectors_0.34.0           
[27] Matrix_1.4-1               
[28] downloader_0.4             
[29] readr_2.1.2                
[30] stringr_1.4.0              
[31] RCurl_1.98-1.7             
[32] bit_4.0.4                  
[33] biomaRt_2.52.0             
[34] munsell_0.5.0              
[35] DelayedArray_0.22.0        
[36] xfun_0.31                  
[37] compiler_4.2.1             
[38] pkgconfig_2.0.3            
[39] BiocGenerics_0.42.0        
[40] tidyselect_1.1.2           
[41] SummarizedExperiment_1.26.1
[42] KEGGREST_1.36.3            
[43] tibble_3.1.7               
[44] GenomeInfoDbData_1.2.8     
[45] IRanges_2.30.0             
[46] matrixStats_0.62.0         
[47] XML_3.99-0.10              
[48] fansi_1.0.3                
[49] dbplyr_2.2.1               
[50] crayon_1.5.1               
[51] dplyr_1.0.9                
[52] tzdb_0.3.0                 
[53] rappdirs_0.3.3             
[54] bitops_1.0-7               
[55] grid_4.2.1                 
[56] jsonlite_1.8.0             
[57] gtable_0.3.0               
[58] lifecycle_1.0.1            
[59] DBI_1.1.3                  
[60] magrittr_2.0.3             
[61] scales_1.2.0               
[62] cli_3.3.0                  
[63] TCGAbiolinksGUI.data_1.16.0
[64] stringi_1.7.8              
[65] cachem_1.0.6               
[66] XVector_0.36.0             
[67] xml2_1.3.3                 
[68] filelock_1.0.2             
[69] ellipsis_0.3.2             
[70] generics_0.1.3             
[71] vctrs_0.4.1                
[72] tools_4.2.1                
[73] bit64_4.0.5                
[74] Biobase_2.56.0             
[75] glue_1.6.2                 
[76] purrr_0.3.4                
[77] hms_1.1.1                  
[78] MatrixGenerics_1.8.1       
[79] fastmap_1.1.0              
[80] AnnotationDbi_1.58.0       
[81] colorspace_2.0-3           
[82] GenomicRanges_1.48.0       
[83] rvest_1.0.2                
[84] memoise_2.0.1              
[85] knitr_1.39

tiagochst commented 2 years ago

There is no more "HTSeq - Counts" in GDC just STAR-counts. For TCGA data, you should be using the unstranded assay.

tiagochst commented 1 year ago

You can use the unstranded raw counts column.

dataPrep <- TCGAanalyze_Preprocessing( object = dataPrep, cor.cut = 0.6, datatype = "unstranded" )

On Thu, Jul 21, 2022 at 2:33 PM ArvinZaker @.***> wrote:

Hello,

I was following the tutorial provided on the TCGAbiolinks website https://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/analysis.html#HTSeq_data:_Downstream_analysis_BRCA to download the HTSeq data from TCGA Prostate Adenocarcinoma dataset.

The code I ran was the same as in the tutorial, except for the project which was changed to TCGA-PRAD:

library(TCGAbiolinks) query <- GDCquery( project = "TCGA-PRAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts" )

samplesDown <- getResults(query,cols=c("cases"))

dataSmTP <- TCGAquery_SampleTypes( barcode = samplesDown, typesample = "TP" )

dataSmNT <- TCGAquery_SampleTypes( barcode = samplesDown, typesample = "NT" ) dataSmTP_short <- dataSmTP[1:10] dataSmNT_short <- dataSmNT[1:10]

query.selected.samples <- GDCquery( project = "TCGA-PRAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", barcode = c(dataSmTP_short, dataSmNT_short) )

GDCdownload( query = query.selected.samples )

dataPrep <- GDCprepare( query = query.selected.samples, save = TRUE )

dataPrep <- TCGAanalyze_Preprocessing( object = dataPrep, cor.cut = 0.6, datatype = "HTSeq - Counts" )

Executing the TCGAanalyze_Preprocessing function results in an error that mentions that HTSeq data was not provided:

Error in TCGAanalyze_Preprocessing(object = dataPrep, cor.cut = 0.6, datatype = "HTSeq - Counts") : HTSeq - Counts not found in the assay list: unstranded, stranded_first, stranded_second, tpm_unstrand, fpkm_unstrand, fpkm_uq_unstrand Please set the correct datatype argument.

I would like to know what changes are needed to correct this error and extract the HTSeq - count data.

Here is the sessionInfo() :

R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Arch Linux

Matrix products: default BLAS: /usr/lib/libblas.so.3.10.1 LAPACK: /usr/lib/liblapack.so.3.10.1

locale: [1] LC_CTYPE=en_US.UTF-8 [2] LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 [8] LC_NAME=C [9] LC_ADDRESS=C [10] LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 [12] LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices [4] utils datasets methods [7] base

other attached packages: [1] TCGAbiolinks_2.24.3

loaded via a namespace (and not attached): [1] Rcpp_1.0.9 [2] lattice_0.20-45 [3] tidyr_1.2.0 [4] prettyunits_1.1.1 [5] png_0.1-7 [6] Biostrings_2.64.0 [7] assertthat_0.2.1 [8] digest_0.6.29 [9] utf8_1.2.2 [10] BiocFileCache_2.4.0 [11] plyr_1.8.7 [12] R6_2.5.1 [13] GenomeInfoDb_1.32.2 [14] stats4_4.2.1 [15] RSQLite_2.2.15 [16] httr_1.4.3 [17] ggplot2_3.3.6 [18] pillar_1.8.0 [19] zlibbioc_1.42.0 [20] rlang_1.0.4 [21] progress_1.2.2 [22] curl_4.3.2 [23] data.table_1.14.2 [24] rstudioapi_0.13 [25] blob_1.2.3 [26] S4Vectors_0.34.0 [27] Matrix_1.4-1 [28] downloader_0.4 [29] readr_2.1.2 [30] stringr_1.4.0 [31] RCurl_1.98-1.7 [32] bit_4.0.4 [33] biomaRt_2.52.0 [34] munsell_0.5.0 [35] DelayedArray_0.22.0 [36] xfun_0.31 [37] compiler_4.2.1 [38] pkgconfig_2.0.3 [39] BiocGenerics_0.42.0 [40] tidyselect_1.1.2 [41] SummarizedExperiment_1.26.1 [42] KEGGREST_1.36.3 [43] tibble_3.1.7 [44] GenomeInfoDbData_1.2.8 [45] IRanges_2.30.0 [46] matrixStats_0.62.0 [47] XML_3.99-0.10 [48] fansi_1.0.3 [49] dbplyr_2.2.1 [50] crayon_1.5.1 [51] dplyr_1.0.9 [52] tzdb_0.3.0 [53] rappdirs_0.3.3 [54] bitops_1.0-7 [55] grid_4.2.1 [56] jsonlite_1.8.0 [57] gtable_0.3.0 [58] lifecycle_1.0.1 [59] DBI_1.1.3 [60] magrittr_2.0.3 [61] scales_1.2.0 [62] cli_3.3.0 [63] TCGAbiolinksGUI.data_1.16.0 [64] stringi_1.7.8 [65] cachem_1.0.6 [66] XVector_0.36.0 [67] xml2_1.3.3 [68] filelock_1.0.2 [69] ellipsis_0.3.2 [70] generics_0.1.3 [71] vctrs_0.4.1 [72] tools_4.2.1 [73] bit64_4.0.5 [74] Biobase_2.56.0 [75] glue_1.6.2 [76] purrr_0.3.4 [77] hms_1.1.1 [78] MatrixGenerics_1.8.1 [79] fastmap_1.1.0 [80] AnnotationDbi_1.58.0 [81] colorspace_2.0-3 [82] GenomicRanges_1.48.0 [83] rvest_1.0.2 [84] memoise_2.0.1 [85] knitr_1.39

— Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/527, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6L4DQV43ORA45PJTPLVVGCUZANCNFSM54IN5ZSA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ArvinZaker commented 1 year ago

Thank you for the information! The issue is resolved!

BioinformaticsFMRP / TCGAbiolinks

Error in obtaining HTSeq - count data for TCGA-PRAD #527