BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
296 stars 112 forks source link

error in expandBarcodeInfo in novel TCGA datasets #355

Closed brdfrdd closed 2 years ago

brdfrdd commented 5 years ago

Hi there,

I am trying to download the more recent datasets: CGCI-BLGSP, CPTAC-3, BEATAML1.0-COHORT, CTSP-DLBCL1, MMRF-COMMPASS and NCICCR-DLBCL.

All of these give the error message: "Error in expandBarcodeInfo(barcodes) : object 'ret' not found"

The query was for data.category="Transcriptome Profiling", data.type="Gene Expression Quantification" and workflow.type="HTSeq - Counts".

My guess is, that the newer datasets have incompatible barcodes or none of the kind?

Hope you can help.

Kind regards!

tiagochst commented 5 years ago

Hello,

I still need to find some time to work on the new projects for the moment expandBarcodeInfo is only working with TCGA and TARGET.

I remembered a user asked it for CPTAC, but GDC maintainers were not able to provide me with information about the barcode. I'll try to work on this soon.

Best regards, Tiago

tiagochst commented 5 years ago

I am working on it, but I think I made some progress: http://rpubs.com/tiagochst/544330

brdfrdd commented 5 years ago

That looks great! I mean it seems the getResults() now gets the data out. When you can deploy you think?

tiagochst commented 5 years ago

I might time some time to push to Bioconductor since I have to review documentation and do more tests for more projects/platforms. For the moment, if you want, you can install the Github version.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
gamerino commented 5 years ago

Hello! I'm also trying to use CPTAC-3 data from GDC. I have installed the github current version of TCGAbiolinks. I obtained the gene expression quantification files (HTSeq - Counts). But, when I ran the GDCPrepare function I had obtained this error.

library(TCGAbiolinks)
directory=getwd()
projects <- TCGAbiolinks:::getGDCprojects()$project_id
projects <- projects[grepl('^CPTAC',projects,perl=T)]
projects
# [1] "CPTAC-3"                                                                                                                                                                                                   
query <- GDCquery(project = projects, data.category = "Transcriptome Profiling", 
data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts")
tryCatch(GDCdownload(query, method = "api", files.per.chunk = 20, directory=directory),
error = function(e) GDCdownload(query, method = "client",directory=directory))

data <- GDCprepare(query, directory=getwd())
|====================================================|100%                      Completed after 45 s 
Starting to add information to samples
 => Add clinical information to samples
Error in df$submitter_id : object of type 'closure' is not subsettable

I tried to run the GDCPrepare line by line and I had found that when the getBarcodeInfo() function is called, an error had been produced, since the length of the results$submitter_id is different of the length of samples$submitter_id. Thus, the assignment samples$submitter_id <- str_extract_all(samples$submitter_id, paste(submitter_id,collapse = "|")) %>% unlist %>% as.character fails. I think that the problem is that the 19 element of the samples$submitter_id didn't match with none of the submitter_id, thus, the str_extract_all() functions produced a character of length 0L in this position, which is then removed when the unlist is used.

samples$submitter_id <- str_extract_all(samples$submitter_id, paste(submitter_id,collapse = "|")) %>%
+         unlist %>% as.character
Error in set(x, j = name, value = value) : 
  Supplied 19 items to be assigned to 20 items of column 'submitter_id'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

submitter_id
 [1] "C3L-01304" "C3L-01682" "C3N-01024" "C3L-00938" "C3N-00175" "11LU035"  
 [7] "C3N-00733" "C3N-01415" "C3L-01683" "C3L-01312" "C3N-02087" "C3L-00140"
[13] "C3N-01823" "C3N-00580" "C3L-00144" "C3N-01802" "C3N-00340" "C3L-00586"
[19] "C3N-01537"

samples$submitter_id
 [1] "C3N-00580-06"                   "C3N-02087-01"                  
 [3] "C3L-00586-03"                   "C3N-01823-02"                  
 [5] "C3N-01024-06"                   "C3L-01304-01"                  
 [7] "C3L-00938-01"                   "C3N-00733-06"                  
 [9] "C3N-01802-01"                   "C3L-01683-04"                  
[11] "C3L-00144-06"                   "C3N-01537-01"                  
[13] "C3N-01415-01"                   "C3N-00340-03"                  
[15] "C3N-00175-02"                   "C3L-00140-06"                  
[17] "C3L-01312-01"                   NA                              
[19] "2f2e5477-42a4-4906-a943-bf7f80" "C3L-01682-07"

str_extract_all(samples$submitter_id, paste(submitter_id,collapse = "|"))
[[1]]
[1] "C3N-00580"

[[2]]
[1] "C3N-02087"

[[3]]
[1] "C3L-00586"

[[4]]
[1] "C3N-01823"

[[5]]
[1] "C3N-01024"

[[6]]
[1] "C3L-01304"

[[7]]
[1] "C3L-00938"

[[8]]
[1] "C3N-00733"

[[9]]
[1] "C3N-01802"

[[10]]
[1] "C3L-01683"

[[11]]
[1] "C3L-00144"

[[12]]
[1] "C3N-01537"

[[13]]
[1] "C3N-01415"

[[14]]
[1] "C3N-00340"

[[15]]
[1] "C3N-00175"

[[16]]
[1] "C3L-00140"

[[17]]
[1] "C3L-01312"

[[18]]
[1] NA

[[19]]
character(0)

[[20]]
[1] "C3L-01682"

str_extract_all(samples$submitter_id, paste(submitter_id,collapse = "|")) %>%
+         unlist %>% as.character
 [1] "C3N-00580" "C3N-02087" "C3L-00586" "C3N-01823" "C3N-01024" "C3L-01304"
 [7] "C3L-00938" "C3N-00733" "C3N-01802" "C3L-01683" "C3L-00144" "C3N-01537"
[13] "C3N-01415" "C3N-00340" "C3N-00175" "C3L-00140" "C3L-01312" NA         
[19] "C3L-01682"

I have fixed the error converting this character of 0 length to an NA after the unlist use as is shown here:

tryCatch({
      samples$submitter_id <- str_extract_all(samples$submitter_id, paste(submitter_id,collapse = "|")) 
      ##added for avoiding remotion of samples$submitter ids which do not match wiht none of the submitter_id
     samples$submitter_id[do.call(c,lapply(samples$submitter_id, function(x){length(x)==0}))]=NA
     samples$submitter_id <- samples$submitter_id %>%
        unlist %>% as.character
    }, error = function(e){
      samples$submitter_id <- submitter_id
    })
gamerino commented 5 years ago

Hello! I have been exploring my results on CPTAC-3 data and I found that the rownames and the sample columns of the metadata matrix do not match as is shown below.

head(metaData[,1:4])
DataFrame with 6 rows and 4 columns
                                sample submitter_id
                              <factor>  <character>
C3N-00321-01              C3N-00321-01    C3N-00321
C3L-00088-02;C3L-00088-01 C3N-01261-03    C3N-01261
C3N-01023-06              C3L-00093-06    C3L-00093
C3N-01261-03              C3L-00583-06    C3L-00583
C3L-00093-06              C3N-00646-06    C3N-00646
C3L-00583-06              C3N-00831-01    C3N-00831

I have also found the same issue in your RPubs example

tiagochst commented 5 years ago

@gamerino It should be fixed now. Many thanks for reporting the problem.

Also, for CPTAC-3 project, besides the pooled samples, which the package is not adding any metadata (I still need to think better about those cases), there is this sample below, that also seems to have some missing information (the sample name is not C3L-)

Screenshot at 10-45-16

bradleybio commented 4 years ago

@tiagochst First of all, thank you for writing and maintaining such a useful and excellent package!

Like the other users above, I am eager to use TCGAbiolinks to access the new dataset 'BEATAML1.0-COHORT'. After installing the version on Github ('BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")'), I was able to successfully invoke getResults for that dataset. However, I discovered that currently it is not possible to link files to case IDs from the project. Such linkage is important in order to be able to identify matched files or link BAM files to associated clinical data. For example, the GDC Data Portal lists two BAM files for case ID 2558, which correspond to matched tumor-normal WXS data:

  1. File de2ec342-5202-4e80-aca0-8eece23e28fa_wxs_gdc_realn.bam (UUID 3d0840f8-f187-4ec4-834d-3cdd5dd5212f); aliquot aq-BA2962D
  2. File 866be858-0c5c-49ae-b2e2-1a2a5176030a_wxs_gdc_realn.bam (UUID 23b39ce7-5334-4850-8a36-8c7ce079d166); aliquot aq-BA2578D When I invoke getResults, I see the two above files, but the column 'cases' holds the aliquot IDs specified above rather than '2558', which is what I expected based on my past experience with using TCGAbiolinks for accessing TCGA data. Currently there is no column holding the case ID, so that it isn't possible to know that the above two files are linked (but please let me know if I'm missing something!). Would it be possible to modify the 'cases' column to hold the case ID?

Please find the specific commands below.

> query = GDCquery (project = "BEATAML1.0-COHORT", data.category = "Sequencing Reads", experimental.strategy = "WXS")
> results = getResults (query) %>% as_tibble() %>% filter (data_format == "BAM")
> results %>% filter (file_name %in% c ("de2ec342-5202-4e80-aca0-8eece23e28fa_wxs_gdc_realn.bam", "866be858-0c5c-49ae-b2e2-1a2a5176030a_wxs_gdc_realn.bam")) %>% as.data.frame
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grDevices utils     datasets  stats     graphics  methods   base     

other attached packages:
[1] TCGAbiolinks_2.15.1 skimr_2.0.2         dplyr_0.8.3         tidyr_1.0.0         readr_1.3.1         tibble_2.1.3        magrittr_1.5       

loaded via a namespace (and not attached):
  [1] pkgcond_0.1.0               colorspace_1.4-1            selectr_0.4-2               ggsignif_0.6.0              hwriter_1.3.2               testextra_0.1.0            
  [7] XVector_0.24.0              GenomicRanges_1.36.1        base64enc_0.1-3             ggpubr_0.2.4                ggrepel_0.8.1               bit64_0.9-7                
 [13] AnnotationDbi_1.46.1        xml2_1.2.2                  codetools_0.2-16            splines_3.6.1               R.methodsS3_1.7.1           doParallel_1.0.15          
 [19] DESeq_1.36.0                geneplotter_1.62.0          knitr_1.26                  zeallot_0.1.0               jsonlite_1.6                Rsamtools_2.0.3            
 [25] km.ci_0.5-2                 broom_0.5.2                 annotate_1.62.0             R.oo_1.23.0                 compiler_3.6.1              httr_1.4.1                 
 [31] backports_1.1.5             assertthat_0.2.1            Matrix_1.2-18               lazyeval_0.2.2              limma_3.40.6                htmltools_0.4.0            
 [37] prettyunits_1.0.2           tools_3.6.1                 gtable_0.3.0                glue_1.3.1                  GenomeInfoDbData_1.2.1      ggthemes_4.2.0             
 [43] ShortRead_1.42.0            Rcpp_1.0.3                  Biobase_2.44.0              vctrs_0.2.0                 Biostrings_2.52.0           nlme_3.1-142               
 [49] rtracklayer_1.44.4          iterators_1.0.12            xfun_0.11                   stringr_1.4.0               testthat_2.3.0              rvest_0.3.5                
 [55] lifecycle_0.1.0             XML_3.98-1.20               edgeR_3.26.8                zoo_1.8-6                   postlogic_0.1.0             zlibbioc_1.30.0            
 [61] scales_1.1.0                aroma.light_3.14.0          hms_0.5.2                   parallel_3.6.1              SummarizedExperiment_1.14.1 RColorBrewer_1.1-2         
 [67] memoise_1.1.0               gridExtra_2.3               KMsurv_0.1-5                ggplot2_3.2.1               downloader_0.4              biomaRt_2.40.5             
 [73] latticeExtra_0.6-28         stringi_1.4.3               RSQLite_2.1.2               genefilter_1.66.0           S4Vectors_0.22.1            foreach_1.4.7              
 [79] GenomicFeatures_1.36.4      BiocGenerics_0.30.0         BiocParallel_1.18.1         repr_1.0.1                  GenomeInfoDb_1.20.0         rlang_0.4.2                
 [85] pkgconfig_2.0.3             matrixStats_0.55.0          bitops_1.0-6                lattice_0.20-38             purrr_0.3.3                 GenomicAlignments_1.20.1   
 [91] bit_1.1-14                  tidyselect_0.2.5            plyr_1.8.4                  R6_2.4.1                    IRanges_2.18.3              generics_0.0.2             
 [97] DelayedArray_0.10.0         DBI_1.0.0                   mgcv_1.8-31                 pillar_1.4.2                survival_3.1-7              RCurl_1.95-4.12            
[103] EDASeq_2.18.0               crayon_1.3.4                purrrogress_0.1.1           survMisc_0.5.5              progress_1.2.2              locfit_1.5-9.1             
[109] grid_3.6.1                  sva_3.32.1                  data.table_1.12.6           blob_1.2.0                  digest_0.6.23               xtable_1.8-4               
[115] R.utils_2.9.0               stats4_3.6.1                munsell_0.5.0               survminer_0.4.6             parsetools_0.1.1           
tiagochst commented 4 years ago

@bradleybio Thanks for the suggestion. I just added the case submitter ID to the query results: http://rpubs.com/tiagochst/TCGAbiolinks_case_submitterID

bradleybio commented 4 years ago

@tiago Thank you for the amazingly fast and helpful addition! This is perfect.

lay-lei commented 4 years ago

Hello! First off all, thank you for solving the problem. I have installed the github current version of TCGAbiolinks. And I could download the gene expression quantification files (HTSeq - Counts) from NCICCR-DLBCL. But when I tried to obtain the gene expression quantification files (HTSeq - Counts) from BEATAML1.0-COHORT, I got something trouble, when I ran the GDCPrepare function I had obtained this error.

> TCGAbiolinks:::getProjectSummary("BEATAML1.0-COHORT")
> query <- GDCquery(project = "BEATAML1.0-COHORT", 
                 legacy = FALSE, 
                 experimental.strategy = "RNA-Seq", 
                 data.category = "Transcriptome Profiling", 
                 data.type = "Gene Expression Quantification", 
                 workflow.type = "HTSeq - Counts")

> GDCdownload(query)
> data_1 <- GDCprepare(query)

> GDCdownload(query)
Downloading data for project BEATAML1.0-COHORT
Of the 510 files for download 510 already exist.
All samples have been already downloaded
> data_1 <- GDCprepare(query)
|====================================================|100%                      Completed after 28 s
Starting to add information to samples
 => Add clinical information to samples

Error: Argument 2 must be length 2, not 1
In addition: Warning message:
In `[<-.data.table`(x, j = name, value = value) :
  Supplied 19 items to be assigned to 20 items of column 'submitter_id' (recycled leaving remainder of 1 items).