GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
189 stars 24 forks source link

discovery and quant issue #337

Closed seongwoohan closed 1 year ago

seongwoohan commented 1 year ago

Hello bambu developers,

I am an avid bambu user and having a hard time running discovery = FALSE, quant = FALSE on my side. I tried re-starting bambu and re-running everything from the beginning as recommended, but unable to get the outputs I want. Here are the command lines I used below and their corresponding outputs. I added the sessionInfo() output on the bottom and the commands that generated bam file from the fastq file. I have bambu 3.0.2 and R 4.2.1.

test.bam <- system.file("extdata", "ENCFF563QZR.bam", package = "bambu")
fa.file <- system.file("extdata", "hg38.fa", package = "bambu")
gtf.file <- system.file("extdata", "gencode.v42.annotation.gtf", package = "bambu")

bambuAnnotations <- prepareAnnotations(gtf.file)

se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file, discovery = FALSE, quant = FALSE)[[1]]

image

test.bam <- system.file("extdata", "ENCFF563QZR.bam", package = "bambu")
fa.file <- system.file("extdata", "hg38.fa", package = "bambu")
gtf.file <- system.file("extdata", "gencode.v42.annotation.gtf", package = "bambu")

bambuAnnotations <- prepareAnnotations(gtf.file)

se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file, discovery = FALSE, quant = FALSE,  verbose=TRUE)[[1]]

image

sessionInfo() R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base

other attached packages: [1] bambu_3.0.2 BSgenome_1.66.1 [3] rtracklayer_1.58.0 Biostrings_2.66.0 [5] XVector_0.38.0 SummarizedExperiment_1.28.0 [7] Biobase_2.58.0 GenomicRanges_1.50.1 [9] GenomeInfoDb_1.34.3 IRanges_2.32.0 [11] S4Vectors_0.36.0 BiocGenerics_0.44.0 [13] MatrixGenerics_1.10.0 matrixStats_0.63.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.9 lattice_0.20-45 tidyr_1.2.1 [4] prettyunits_1.1.1 png_0.1-7 Rsamtools_2.14.0 [7] assertthat_0.2.1 digest_0.6.30 utf8_1.2.2 [10] BiocFileCache_2.6.0 R6_2.5.1 RSQLite_2.2.19 [13] httr_1.4.4 pillar_1.8.1 zlibbioc_1.44.0 [16] rlang_1.0.6 GenomicFeatures_1.50.2 progress_1.2.2 [19] curl_4.3.3 data.table_1.14.6 blob_1.2.3 [22] Matrix_1.4-1 BiocParallel_1.32.3 stringr_1.4.1 [25] RCurl_1.98-1.9 bit_4.0.5 biomaRt_2.54.0 [28] DelayedArray_0.24.0 compiler_4.2.1 pkgconfig_2.0.3 [31] tidyselect_1.2.0 KEGGREST_1.38.0 tibble_3.1.8 [34] GenomeInfoDbData_1.2.9 codetools_0.2-18 XML_3.99-0.12 [37] fansi_1.0.3 withr_2.5.0 crayon_1.5.2 [40] dplyr_1.0.10 dbplyr_2.2.1 rappdirs_0.3.3 [43] GenomicAlignments_1.34.0 bitops_1.0-7 grid_4.2.1 [46] jsonlite_1.8.3 lifecycle_1.0.3 DBI_1.1.3 [49] magrittr_2.0.3 cli_3.4.1 stringi_1.7.8 [52] cachem_1.0.6 xml2_1.3.3 filelock_1.0.2 [55] ellipsis_0.3.2 vctrs_0.5.1 generics_0.1.3 [58] xgboost_1.6.0.1 rjson_0.2.21 restfulr_0.0.15 [61] tools_4.2.1 bit64_4.0.5 glue_1.6.2 [64] purrr_0.3.5 hms_1.1.2 parallel_4.2.1 [67] fastmap_1.1.0 yaml_2.3.6 AnnotationDbi_1.60.0 [70] BiocManager_1.30.19 memoise_2.0.1 BiocIO_1.8.0

./minimap2 -t 8 -ax splice /data/long_read/paper_ref/hg38.fa /data/long_read/lr_consoritum/pcb/ENCFF563QZR.fq -o /data/long_read/lr_consoritum/pcb/ENCFF563QZR.sam

samtools view -@ 8 -Sb -o /data/long_read/lr_consoritum/pcb/ENCFF563QZR.bam /data/long_read/lr_consoritum/pcb/ENCFF2ENCFF563QZR45IPA.sam
andredsim commented 1 year ago

Hi Seong,

Judging from the screenshot it looks like it ran. Can you share what the se object looks like? That should be the read class file.

Kind Regards, Andre Sim

seongwoohan commented 1 year ago

Hi Andre,

You are right. Let me provide the output I got from the following code.

test.bam <- system.file("extdata", "ENCFF563QZR.bam", package = "bambu")
fa.file <- system.file("extdata", "hg38.fa", package = "bambu")
gtf.file <- system.file("extdata", "gencode.v42.annotation.gtf", package = "bambu")

bambuAnnotations <- prepareAnnotations(gtf.file)

se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file, discovery = FALSE, quant = FALSE,  verbose=TRUE)

se [[1]] class: RangedSummarizedExperiment dim: 287609 1 metadata(1): warnings assays(1): counts rownames(287609): rc.1 rc.2 ... rcunsplicedNew.9998 rcunsplicedNew.9999 rowData names(23): chr.rc strand.rc ... txScore.noFit txScore colnames(1): ENCFF563QZR colData names(1): name

If I do rowData(se[[1]]), I get the information as I want! image

However, I am having a trouble saving this document to the directory. Could you help me to save this document as any file format such as tsv or csv, so I can work with python pandas?

image

Another one, if possible, I would like to extract information from constructedAnnotations = se[assays(se)$fullLengthCounts > 0], which is the gtf file that only bambu makes. Is it possible to extract the read numbers per junction file just for the gtf that bambu makes? I highlighted on the above screenshot.

Thanks a lot for helping me all the time :)

Best, Seong

andredsim commented 1 year ago

Hi Seong,

To output the read class file table you can simply just convert it to a matrix and use the R native write.table() function

write.table(as.matrix(rowData(se[[1]])), file = "test.tsv", sep = "\t")

The reason you can't extract the full length counts is you have an incomplete se since you didn't run quant and discovery. Note that under assays fullLengthCounts isn't there. However in the rowData(se) table there is a readCount column. This represents the full-length read counts for each read class.

By the way, the reason your saveRDS lines are coming up with an error, is that you should change the "path" argument to "file". An rds will not be easily readable in pandas however so it won't serve your purpose.

Hope this helps.