GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
186 stars 23 forks source link

Large numebr of BAM files leads to Error in `vec_interleave_indices()`: #450

Open NikoLichi opened 1 week ago

NikoLichi commented 1 week ago

Dear Bambu team,

I am running a massive project with 480 BAM files with ~4.8 TB total data. Following the previous suggestion for Bambu, I am running first the extended annotations (quant = FALSE), with the idea of running the quantification later in batches.

However, the is a major issue when starting the extended annotations:

--- Start extending annotations ---
Error in `vec_interleave_indices()`:
! Long vectors are not yet supported in `vec_interleave()`. Result from interleaving would have size 8857886400, which is larger than the maximum supported size of 2^31 - 1.
Backtrace:
     ▆
  1. ├─bambu::bambu(...)
  2. │ └─bambu:::bambu.extendAnnotations(...)
  3. │   └─bambu:::isore.combineTranscriptCandidates(...)
  4. │     ├─... %>% data.table()
  5. │     └─bambu:::combineSplicedTranscriptModels(...)
  6. │       └─bambu:::updateStartEndReadCount(combinedFeatureTibble)
  7. │         └─... %>% mutate(sumReadCount = sum(readCount, na.rm = TRUE))
  8. ├─data.table::data.table(.)
  9. ├─dplyr::mutate(., sumReadCount = sum(readCount, na.rm = TRUE))
 10. ├─dplyr::group_by(., rowID)
 11. ├─tidyr::pivot_longer(...)
 12. ├─tidyr:::pivot_longer.data.frame(...)
 13. │ └─tidyr::pivot_longer_spec(...)
 14. │   └─vctrs::vec_interleave(!!!val_cols, .ptype = val_type)
 15. │     └─vctrs:::vec_interleave_indices(n, size)
 16. └─rlang::abort(message = message)
Execution halted

Is there anything I could do to run Bambu?

My code looks like:

BAMlist = BAMs_one_per_Line
fa.file <- "/refData/release46/GRCh38.primary_assembly.genome.fa"
gtf.file <-  "/refData/release46/gencode.v46.primary_assembly.annotation.gtf"
bambuAnnotations <- prepareAnnotations(gtf.file)

extendedAnnotations = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, quant = FALSE, lowMemory=T, ncore = 14, rcOutDir="MY_PATH/bambu_20241015_all")

As an additional note, I also have the same warning message as some others have reported as issue #407 .

This is with R 4.3.2 and Bioc 3.18 / bambu (3.4.1). Platform: x86_64-conda-linux-gnu (64-bit)

All the best, Niko

NikoLichi commented 10 hours ago

Hello there,

I made some progress with a different approach but bambu is still failing in the same issue.

I divided the data in 3 batches to obtain the extended annotations in the .rds. I set it up like below for each of the 3 batches:

extendedAnnotations = bambu(reads = BAMs_one_per_Line, annotations = bambuAnnotations, genome = fa.file, NDR = 0.1, quant = FALSE, lowMemory=T, ncore = 14, rcOutDir="MY_DIR")

As a small note, I tested before with default NDR, and each batch had an NDR > 0.6, but I am working with human samples, so I decided to fix NDR=0.1.

After this, I put together the .rds files for each one of the three bambu runs listing the .rds files for each run in a vector and emerging them like:

B01 <- Myfiles01
B02 <- Myfiles02
B03 <- Myfiles03

B_allRDs <- c(B01,B02,B03)

mergedAnno = bambu(reads = B_allRDs, genome = fa.file, annotations = bambuAnnotations, quant = FALSE)

But then, I got the same error as above and the first lines are:

--- Start extending annotations ---
Error in `vec_interleave_indices()`:
! Long vectors are not yet supported in `vec_interleave()`. Result from interleaving would have size 17715772800, which is larger than the maximum supported size of 2^31 - 1.

I would appreciate any help how to run this massive data set, Best, Niko