GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
190 stars 24 forks source link

read assignment information #445

Open sparthib opened 2 months ago

sparthib commented 2 months ago

Hi there,

I am using bambu 3.4.0 for quantification. I am curious about the read to transcript assignment, so I used the function mentioned in the document metadata(se)$readToTranscriptMaps[[1]], but I don't find the readID column while the other columns are present. I tried the multi-sample approach where first, I generated the read class with trackReads = TRUE

se_read_class <- bambu(reads = paste0(bam_dir),
                     annotations = annotation,
                     genome = fa.file,
                     rcOutDir = output_dir,
                     lowMemory = TRUE,
                     trackReads = TRUE)

and then, performed quantification from the intermediate files.

se <- bambu(reads = rds_files,
              annotations = annotation,
              genome = fa.file, 
            trackReads = TRUE)

Would be great to have more insights on this.

Thanks,

Sowmya

andredsim commented 2 months ago

Hi Sowmya,

Could you post the output of metadata(se)$readToTranscriptMaps[[1]]?

If you read in the rds_files (with readRDS) does the metadata contain the readNames and readId like in the example below that I generated from the test data? image

Does your bam file still have read names, some archives and tools remove the read names in order to reduce the size of the file?

From looking at your inputs, it looks correct to me (note in the first block, if your goal is only to produce the read classes, you can set discovery & quant = FALSE to skip these steps)

*I also want to add in case you were not aware. The readToTranscriptMaps are not informative to how quantification works. Unlike other quantification tools which assign reads to transcripts to generate counts, bambu uses an expectation maximization algorithm to assign the majority of reads, so direct read to transcript assignments for all reads will not exist. There are other uses for the read to transcript map though so I would be happy to continue to troubleshoot if that is the case.

Kind Regards, Andre Sim

sparthib commented 2 months ago

Hi Andre, thank you for super quick response,

My bam file contains the read IDs, so I don't think that's the issue.

This is what my metadata looks like for the final output

  equalMatches compatibleMatches
1                               
2                               
3                               
4                               
5                               
6 

Here's a snippet of my intermediate read class file

> intermediate
class: RangedSummarizedExperiment 
dim: 4530587 1 
metadata(3): warnings readNames readId
assays(1): counts
rownames(4530587): rc.1 rc.2 ... rc.4530586 rc.4530587
rowData names(23): chr.rc strand.rc ... txScore.noFit txScore
colnames(1): sample_name
colData names(1): name

readNames contains the list of IDs in my bam file.

If a read is partially assigned to multiple isoforms using EM, what would be the usecase for readToTranscript mapping if I may ask? Also, by other tools, do you mean tools that use the reference transcriptome instead of genome?

Best, Sowmya

andredsim commented 2 months ago

Hi Sowmya,

The intermediate file looks fine to me too, so seeing that final output is very strange. Are you able to run bambu using the test data with trackReads on to see if that works? To solve this I may need you to send me the intermediate file so that I can see if I can replicate this on my end. Would this be possible?

Regarding your final question. The reason we implemented the trackReads option in the first place was we were entering the LRGASP evaluation which required a read to transcript map as one of the submission requirements. Despite it not being useful for quantification we realized there were use cases where it was useful such as identifying the reads that are assigned to novel transcripts to then extract them and do single molecule analysis such as looking for RNA modifications. By other tools I mean tools such as (but not limited to) IsoQuant and FLAIR which also happen to be genome alignment based tools.

Kind Regards, Andre Sim

sparthib commented 1 month ago

hi Andre,

Thanks for your answer. I am re-running bambu, and I see a new warning I haven't see before and the read classes end up not getting produced.

Here's my code:

read_class <- bambu(reads = paste0(bam_dir),
                     annotations = annotation,
                     genome = fa.file,
                     lowMemory = TRUE, 
                     discovery = FALSE, 
                     quant = FALSE)
saveRDS(read_class, paste0(output_dir, "read_class.rds"))

Here's the warning

[10:02:31] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.

Here's my session info:

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 Patched (2024-02-08 r85876)
 os       Rocky Linux 9.2 (Blue Onyx)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       US/Eastern
 date     2024-06-01
─ Packages ───────────────────────────────────────────────────────────────────
 package              * version     date (UTC) lib source
 abind                  1.4-5       2016-07-21 [2] CRAN (R 4.3.2)
 AnnotationDbi          1.64.1      2023-11-03 [2] Bioconductor
 bambu                * 3.4.0       2023-10-24 [1] Bioconductor
 Biobase              * 2.62.0      2023-10-24 [2] Bioconductor
 BiocFileCache        * 2.10.1      2023-10-26 [2] Bioconductor
 BiocGenerics         * 0.48.1      2023-11-01 [2] Bioconductor
 BiocIO               * 1.12.0      2023-10-24 [2] Bioconductor
 BiocManager            1.30.22     2023-08-08 [2] CRAN (R 4.3.2)
 BiocParallel           1.36.0      2023-10-24 [2] Bioconductor
 biomaRt                2.58.2      2024-01-30 [2] Bioconductor 3.18 (R 4.3.2)
 Biostrings           * 2.70.2      2024-01-28 [2] Bioconductor 3.18 (R 4.3.2)
 bit                    4.0.5       2022-11-15 [2] CRAN (R 4.3.2)
 bit64                  4.0.5       2020-08-30 [2] CRAN (R 4.3.2)
 bitops                 1.0-7       2021-04-24 [2] CRAN (R 4.3.2)
 blob                   1.2.4       2023-03-17 [2] CRAN (R 4.3.2)
 BSgenome             * 1.70.1      2023-11-01 [2] Bioconductor
 cachem                 1.0.8       2023-05-01 [2] CRAN (R 4.3.2)
 cli                    3.6.2       2023-12-11 [2] CRAN (R 4.3.2)
 codetools              0.2-19      2023-02-01 [3] CRAN (R 4.3.2)
 crayon                 1.5.2       2022-09-29 [2] CRAN (R 4.3.2)
 curl                   5.2.0       2023-12-08 [2] CRAN (R 4.3.2)
 data.table             1.15.0      2024-01-30 [2] CRAN (R 4.3.2)
 DBI                    1.2.1       2024-01-12 [2] CRAN (R 4.3.2)
 dbplyr               * 2.4.0       2023-10-26 [2] CRAN (R 4.3.2)
 DelayedArray           0.28.0      2023-10-24 [2] Bioconductor
 digest                 0.6.34      2024-01-11 [2] CRAN (R 4.3.2)
 dplyr                * 1.1.4       2023-11-17 [2] CRAN (R 4.3.2)
 fansi                  1.0.6       2023-12-08 [2] CRAN (R 4.3.2)
 fastmap                1.1.1       2023-02-24 [2] CRAN (R 4.3.2)
 filelock               1.0.3       2023-12-11 [2] CRAN (R 4.3.2)
 generics               0.1.3       2022-07-05 [2] CRAN (R 4.3.2)
 GenomeInfoDb         * 1.38.5      2023-12-28 [2] Bioconductor 3.18 (R 4.3.2)
 GenomeInfoDbData       1.2.11      2024-02-09 [2] Bioconductor
 GenomicAlignments      1.38.2      2024-01-16 [2] Bioconductor 3.18 (R 4.3.2)
 GenomicFeatures        1.54.3      2024-01-31 [2] Bioconductor 3.18 (R 4.3.2)
 GenomicRanges        * 1.54.1      2023-10-29 [2] Bioconductor
 glue                   1.7.0       2024-01-09 [2] CRAN (R 4.3.2)
 hms                    1.1.3       2023-03-21 [2] CRAN (R 4.3.2)
 httr                   1.4.7       2023-08-15 [2] CRAN (R 4.3.2)
 IRanges              * 2.36.0      2023-10-24 [2] Bioconductor
 jsonlite               1.8.8       2023-12-04 [2] CRAN (R 4.3.2)
 KEGGREST               1.42.0      2023-10-24 [2] Bioconductor
 lattice                0.22-5      2023-10-24 [3] CRAN (R 4.3.2)
 lifecycle              1.0.4       2023-11-07 [2] CRAN (R 4.3.2)
 magrittr               2.0.3       2022-03-30 [2] CRAN (R 4.3.2)
 Matrix                 1.6-5       2024-01-11 [3] CRAN (R 4.3.2)
 MatrixGenerics       * 1.14.0      2023-10-24 [2] Bioconductor
 matrixStats          * 1.2.0       2023-12-11 [2] CRAN (R 4.3.2)
 memoise                2.0.1       2021-11-26 [2] CRAN (R 4.3.2)
 pillar                 1.9.0       2023-03-22 [2] CRAN (R 4.3.2)
 pkgconfig              2.0.3       2019-09-22 [2] CRAN (R 4.3.2)
 png                    0.1-8       2022-11-29 [2] CRAN (R 4.3.2)
 prettyunits            1.2.0       2023-09-24 [2] CRAN (R 4.3.2)
 progress               1.2.3       2023-12-06 [2] CRAN (R 4.3.2)
 purrr                  1.0.2       2023-08-10 [2] CRAN (R 4.3.2)
 R6                     2.5.1       2021-08-19 [2] CRAN (R 4.3.2)
 rappdirs               0.3.3       2021-01-31 [2] CRAN (R 4.3.2)
 Rcpp                   1.0.12      2024-01-09 [2] CRAN (R 4.3.2)
 RCurl                  1.98-1.14   2024-01-09 [2] CRAN (R 4.3.2)
 readr                * 2.1.5       2024-01-10 [2] CRAN (R 4.3.2)
 restfulr               0.0.15      2022-06-16 [2] CRAN (R 4.3.2)
 rjson                  0.2.21      2022-01-09 [2] CRAN (R 4.3.2)
 rlang                  1.1.3       2024-01-10 [2] CRAN (R 4.3.2)
 Rsamtools              2.18.0      2023-10-24 [2] Bioconductor
 RSQLite                2.3.5       2024-01-21 [2] CRAN (R 4.3.2)
 rtracklayer          * 1.62.0      2023-10-24 [2] Bioconductor
 S4Arrays               1.2.0       2023-10-24 [2] Bioconductor
 S4Vectors            * 0.40.2      2023-11-23 [2] Bioconductor 3.18 (R 4.3.2)
 sessioninfo          * 1.2.2       2021-12-06 [2] CRAN (R 4.3.2)
 SparseArray            1.2.3       2023-12-25 [2] Bioconductor 3.18 (R 4.3.2)
 stringi                1.8.3       2023-12-11 [2] CRAN (R 4.3.2)
 stringr                1.5.1       2023-11-14 [2] CRAN (R 4.3.2)
 SummarizedExperiment * 1.32.0      2023-10-24 [2] Bioconductor
 tibble                 3.2.1       2023-03-20 [2] CRAN (R 4.3.2)
 tidyr                  1.3.1       2024-01-24 [2] CRAN (R 4.3.2)
 tidyselect             1.2.0       2022-10-10 [2] CRAN (R 4.3.2)
 tzdb                   0.4.0       2023-05-12 [2] CRAN (R 4.3.2)
 utf8                   1.2.4       2023-10-22 [2] CRAN (R 4.3.2)
 vctrs                  0.6.5       2023-12-01 [2] CRAN (R 4.3.2)
 withr                  3.0.0       2024-01-16 [2] CRAN (R 4.3.2)
 xgboost                1.7.7.1     2024-01-25 [2] CRAN (R 4.3.2)
 XML                    3.99-0.16.1 2024-01-22 [2] CRAN (R 4.3.2)
 xml2                   1.3.6       2023-12-04 [2] CRAN (R 4.3.2)
 XVector              * 0.42.0      2023-10-24 [2] Bioconductor
 yaml                   2.3.8       2023-12-11 [2] CRAN (R 4.3.2)
 zlibbioc               1.48.0      2023-10-24 [2] Bioconductor

Let me know if you have any suggestions!

Thanks! Sowmya