Closed brdfrdd closed 2 years ago
Hello,
I still need to find some time to work on the new projects for the moment expandBarcodeInfo
is only working with TCGA and TARGET.
I remembered a user asked it for CPTAC, but GDC maintainers were not able to provide me with information about the barcode. I'll try to work on this soon.
Best regards, Tiago
I am working on it, but I think I made some progress: http://rpubs.com/tiagochst/544330
That looks great! I mean it seems the getResults() now gets the data out. When you can deploy you think?
I might time some time to push to Bioconductor since I have to review documentation and do more tests for more projects/platforms. For the moment, if you want, you can install the Github version.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
Hello! I'm also trying to use CPTAC-3 data from GDC. I have installed the github current version of TCGAbiolinks. I obtained the gene expression quantification files (HTSeq - Counts). But, when I ran the GDCPrepare function I had obtained this error.
library(TCGAbiolinks)
directory=getwd()
projects <- TCGAbiolinks:::getGDCprojects()$project_id
projects <- projects[grepl('^CPTAC',projects,perl=T)]
projects
# [1] "CPTAC-3"
query <- GDCquery(project = projects, data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts")
tryCatch(GDCdownload(query, method = "api", files.per.chunk = 20, directory=directory),
error = function(e) GDCdownload(query, method = "client",directory=directory))
data <- GDCprepare(query, directory=getwd())
|====================================================|100% Completed after 45 s
Starting to add information to samples
=> Add clinical information to samples
Error in df$submitter_id : object of type 'closure' is not subsettable
I tried to run the GDCPrepare line by line and I had found that when the getBarcodeInfo()
function is called, an error had been produced, since the length of the results$submitter_id
is different of the length of samples$submitter_id
. Thus, the assignment samples$submitter_id <- str_extract_all(samples$submitter_id, paste(submitter_id,collapse = "|")) %>% unlist %>% as.character
fails. I think that the problem is that the 19 element of the samples$submitter_id
didn't match with none of the submitter_id
, thus, the str_extract_all()
functions produced a character of length 0L in this position, which is then removed when the unlist
is used.
samples$submitter_id <- str_extract_all(samples$submitter_id, paste(submitter_id,collapse = "|")) %>%
+ unlist %>% as.character
Error in set(x, j = name, value = value) :
Supplied 19 items to be assigned to 20 items of column 'submitter_id'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
submitter_id
[1] "C3L-01304" "C3L-01682" "C3N-01024" "C3L-00938" "C3N-00175" "11LU035"
[7] "C3N-00733" "C3N-01415" "C3L-01683" "C3L-01312" "C3N-02087" "C3L-00140"
[13] "C3N-01823" "C3N-00580" "C3L-00144" "C3N-01802" "C3N-00340" "C3L-00586"
[19] "C3N-01537"
samples$submitter_id
[1] "C3N-00580-06" "C3N-02087-01"
[3] "C3L-00586-03" "C3N-01823-02"
[5] "C3N-01024-06" "C3L-01304-01"
[7] "C3L-00938-01" "C3N-00733-06"
[9] "C3N-01802-01" "C3L-01683-04"
[11] "C3L-00144-06" "C3N-01537-01"
[13] "C3N-01415-01" "C3N-00340-03"
[15] "C3N-00175-02" "C3L-00140-06"
[17] "C3L-01312-01" NA
[19] "2f2e5477-42a4-4906-a943-bf7f80" "C3L-01682-07"
str_extract_all(samples$submitter_id, paste(submitter_id,collapse = "|"))
[[1]]
[1] "C3N-00580"
[[2]]
[1] "C3N-02087"
[[3]]
[1] "C3L-00586"
[[4]]
[1] "C3N-01823"
[[5]]
[1] "C3N-01024"
[[6]]
[1] "C3L-01304"
[[7]]
[1] "C3L-00938"
[[8]]
[1] "C3N-00733"
[[9]]
[1] "C3N-01802"
[[10]]
[1] "C3L-01683"
[[11]]
[1] "C3L-00144"
[[12]]
[1] "C3N-01537"
[[13]]
[1] "C3N-01415"
[[14]]
[1] "C3N-00340"
[[15]]
[1] "C3N-00175"
[[16]]
[1] "C3L-00140"
[[17]]
[1] "C3L-01312"
[[18]]
[1] NA
[[19]]
character(0)
[[20]]
[1] "C3L-01682"
str_extract_all(samples$submitter_id, paste(submitter_id,collapse = "|")) %>%
+ unlist %>% as.character
[1] "C3N-00580" "C3N-02087" "C3L-00586" "C3N-01823" "C3N-01024" "C3L-01304"
[7] "C3L-00938" "C3N-00733" "C3N-01802" "C3L-01683" "C3L-00144" "C3N-01537"
[13] "C3N-01415" "C3N-00340" "C3N-00175" "C3L-00140" "C3L-01312" NA
[19] "C3L-01682"
I have fixed the error converting this character of 0 length to an NA after the unlist use as is shown here:
tryCatch({
samples$submitter_id <- str_extract_all(samples$submitter_id, paste(submitter_id,collapse = "|"))
##added for avoiding remotion of samples$submitter ids which do not match wiht none of the submitter_id
samples$submitter_id[do.call(c,lapply(samples$submitter_id, function(x){length(x)==0}))]=NA
samples$submitter_id <- samples$submitter_id %>%
unlist %>% as.character
}, error = function(e){
samples$submitter_id <- submitter_id
})
Hello!
I have been exploring my results on CPTAC-3 data and I found that the rownames
and the sample
columns of the metadata matrix do not match as is shown below.
head(metaData[,1:4])
DataFrame with 6 rows and 4 columns
sample submitter_id
<factor> <character>
C3N-00321-01 C3N-00321-01 C3N-00321
C3L-00088-02;C3L-00088-01 C3N-01261-03 C3N-01261
C3N-01023-06 C3L-00093-06 C3L-00093
C3N-01261-03 C3L-00583-06 C3L-00583
C3L-00093-06 C3N-00646-06 C3N-00646
C3L-00583-06 C3N-00831-01 C3N-00831
I have also found the same issue in your RPubs example
@gamerino It should be fixed now. Many thanks for reporting the problem.
Also, for CPTAC-3 project, besides the pooled samples, which the package is not adding any metadata (I still need to think better about those cases), there is this sample below, that also seems to have some missing information (the sample name is not C3L-)
@tiagochst First of all, thank you for writing and maintaining such a useful and excellent package!
Like the other users above, I am eager to use TCGAbiolinks to access the new dataset 'BEATAML1.0-COHORT'. After installing the version on Github ('BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")'), I was able to successfully invoke getResults for that dataset. However, I discovered that currently it is not possible to link files to case IDs from the project. Such linkage is important in order to be able to identify matched files or link BAM files to associated clinical data. For example, the GDC Data Portal lists two BAM files for case ID 2558, which correspond to matched tumor-normal WXS data:
Please find the specific commands below.
> query = GDCquery (project = "BEATAML1.0-COHORT", data.category = "Sequencing Reads", experimental.strategy = "WXS")
> results = getResults (query) %>% as_tibble() %>% filter (data_format == "BAM")
> results %>% filter (file_name %in% c ("de2ec342-5202-4e80-aca0-8eece23e28fa_wxs_gdc_realn.bam", "866be858-0c5c-49ae-b2e2-1a2a5176030a_wxs_gdc_realn.bam")) %>% as.data.frame
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grDevices utils datasets stats graphics methods base
other attached packages:
[1] TCGAbiolinks_2.15.1 skimr_2.0.2 dplyr_0.8.3 tidyr_1.0.0 readr_1.3.1 tibble_2.1.3 magrittr_1.5
loaded via a namespace (and not attached):
[1] pkgcond_0.1.0 colorspace_1.4-1 selectr_0.4-2 ggsignif_0.6.0 hwriter_1.3.2 testextra_0.1.0
[7] XVector_0.24.0 GenomicRanges_1.36.1 base64enc_0.1-3 ggpubr_0.2.4 ggrepel_0.8.1 bit64_0.9-7
[13] AnnotationDbi_1.46.1 xml2_1.2.2 codetools_0.2-16 splines_3.6.1 R.methodsS3_1.7.1 doParallel_1.0.15
[19] DESeq_1.36.0 geneplotter_1.62.0 knitr_1.26 zeallot_0.1.0 jsonlite_1.6 Rsamtools_2.0.3
[25] km.ci_0.5-2 broom_0.5.2 annotate_1.62.0 R.oo_1.23.0 compiler_3.6.1 httr_1.4.1
[31] backports_1.1.5 assertthat_0.2.1 Matrix_1.2-18 lazyeval_0.2.2 limma_3.40.6 htmltools_0.4.0
[37] prettyunits_1.0.2 tools_3.6.1 gtable_0.3.0 glue_1.3.1 GenomeInfoDbData_1.2.1 ggthemes_4.2.0
[43] ShortRead_1.42.0 Rcpp_1.0.3 Biobase_2.44.0 vctrs_0.2.0 Biostrings_2.52.0 nlme_3.1-142
[49] rtracklayer_1.44.4 iterators_1.0.12 xfun_0.11 stringr_1.4.0 testthat_2.3.0 rvest_0.3.5
[55] lifecycle_0.1.0 XML_3.98-1.20 edgeR_3.26.8 zoo_1.8-6 postlogic_0.1.0 zlibbioc_1.30.0
[61] scales_1.1.0 aroma.light_3.14.0 hms_0.5.2 parallel_3.6.1 SummarizedExperiment_1.14.1 RColorBrewer_1.1-2
[67] memoise_1.1.0 gridExtra_2.3 KMsurv_0.1-5 ggplot2_3.2.1 downloader_0.4 biomaRt_2.40.5
[73] latticeExtra_0.6-28 stringi_1.4.3 RSQLite_2.1.2 genefilter_1.66.0 S4Vectors_0.22.1 foreach_1.4.7
[79] GenomicFeatures_1.36.4 BiocGenerics_0.30.0 BiocParallel_1.18.1 repr_1.0.1 GenomeInfoDb_1.20.0 rlang_0.4.2
[85] pkgconfig_2.0.3 matrixStats_0.55.0 bitops_1.0-6 lattice_0.20-38 purrr_0.3.3 GenomicAlignments_1.20.1
[91] bit_1.1-14 tidyselect_0.2.5 plyr_1.8.4 R6_2.4.1 IRanges_2.18.3 generics_0.0.2
[97] DelayedArray_0.10.0 DBI_1.0.0 mgcv_1.8-31 pillar_1.4.2 survival_3.1-7 RCurl_1.95-4.12
[103] EDASeq_2.18.0 crayon_1.3.4 purrrogress_0.1.1 survMisc_0.5.5 progress_1.2.2 locfit_1.5-9.1
[109] grid_3.6.1 sva_3.32.1 data.table_1.12.6 blob_1.2.0 digest_0.6.23 xtable_1.8-4
[115] R.utils_2.9.0 stats4_3.6.1 munsell_0.5.0 survminer_0.4.6 parsetools_0.1.1
@bradleybio Thanks for the suggestion. I just added the case submitter ID to the query results: http://rpubs.com/tiagochst/TCGAbiolinks_case_submitterID
@tiago Thank you for the amazingly fast and helpful addition! This is perfect.
Hello! First off all, thank you for solving the problem. I have installed the github current version of TCGAbiolinks. And I could download the gene expression quantification files (HTSeq - Counts) from NCICCR-DLBCL. But when I tried to obtain the gene expression quantification files (HTSeq - Counts) from BEATAML1.0-COHORT, I got something trouble, when I ran the GDCPrepare function I had obtained this error.
> TCGAbiolinks:::getProjectSummary("BEATAML1.0-COHORT")
> query <- GDCquery(project = "BEATAML1.0-COHORT",
legacy = FALSE,
experimental.strategy = "RNA-Seq",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
> GDCdownload(query)
> data_1 <- GDCprepare(query)
> GDCdownload(query)
Downloading data for project BEATAML1.0-COHORT
Of the 510 files for download 510 already exist.
All samples have been already downloaded
> data_1 <- GDCprepare(query)
|====================================================|100% Completed after 28 s
Starting to add information to samples
=> Add clinical information to samples
Error: Argument 2 must be length 2, not 1
In addition: Warning message:
In `[<-.data.table`(x, j = name, value = value) :
Supplied 19 items to be assigned to 20 items of column 'submitter_id' (recycled leaving remainder of 1 items).
Hi there,
I am trying to download the more recent datasets: CGCI-BLGSP, CPTAC-3, BEATAML1.0-COHORT, CTSP-DLBCL1, MMRF-COMMPASS and NCICCR-DLBCL.
All of these give the error message: "Error in expandBarcodeInfo(barcodes) : object 'ret' not found"
The query was for data.category="Transcriptome Profiling", data.type="Gene Expression Quantification" and workflow.type="HTSeq - Counts".
My guess is, that the newer datasets have incompatible barcodes or none of the kind?
Hope you can help.
Kind regards!