Roleren / ORFik

MIT License
32 stars 9 forks source link

Out of Memory error shiftFootPrint function #179

Open josefinaperalba opened 3 months ago

josefinaperalba commented 3 months ago

Hi! I'm running the following code with some big bam files

bam_file <- fimport(
    opt$bam
)

offsets <- ORFik::detectRibosomeShifts(
    footprints = bam_file,
    txdb = opt$ref_db
    #minCDS = 20,
    #top_tx = 100
)

bam_file_shifted <- ORFik::shiftFootprints(
    footprints = bam_file,
    shifts = offsets
)

export.ofst(
    bam_file_shifted,
    file = opt$out_ofst
)

export.ofst(
    convertToOneBasedRanges(
        bam_file_shifted,
        addScoreColumn = TRUE,
        addSizeColumn = TRUE
    ),
    file = opt$out_ofst_ww
)

But I'me getting the following error 'R_Realloc' could not re-allocate memory (18446744065128005632 bytes). Is there a way to optimize this in a way it doen's consume this much memory? I'm running this in order to use the shifted files with the orfik features function.

Thank you so much!

Roleren commented 3 months ago

18446744065128005632 bytes = 18 446 744.1 terabytes (18 Million terabytes!)

What are you loading ? :D and at what step does this error happen ?

josefinaperalba commented 3 months ago

Hi! My bam file is 39 GB and as you mentioned the error it says it failed to reallocate 18446744065128005632 bytes. this is the step that is failing.

`offsets <- ORFik::detectRibosomeShifts( footprints = bam_file, txdb = ref_db

minCDS = 20,

#top_tx = 100

)`

It fails at the fimport step. I have another bam file of 24 GB that worked, is there a way to optimize this?

Roleren commented 3 months ago

First thing to try: Dont load the bam file before you insert it into ORFik::detectRibosomeShifts: so change to this:

bam_file <- fimport(
    opt$bam
)
bam_file <- opt$bam

Since R does not copy by reference in functions, it will have 3 versions of the bam file in memory if you load it first, while only 1 if you give path.

josefinaperalba commented 3 months ago

okay, but in that case I need to load it before the shiftFootprints function right? Because that function doesn't allow a path as an input or is there a more memory efficient way to do it? Doing this before shiftFootprints would work?

txdb <- loadTxdb(ref_db)

cds<-loadRegion(txdb, part = "cds", names.keep = txNames)

bam_file<-fimport(opt$bam, cds)
Roleren commented 3 months ago

True, I will update the function to support file path input, will do some tests on how much memory this saves :)

josefinaperalba commented 2 months ago

Awesome! thank you so much!! Sorry for the multiple questions, but I'm also working in aligning riboseq reads with STAR and I was wondering what is the strategy of this package to deal with the multimapper. I saw in the functions STAR.align of the package that the default value for this parameter max.multimap is 10, there is a logic behind this value? what strategy is recommended?

Roleren commented 2 months ago

10 is good for human default, depends on species, but don't touch it unless you have a good reason :)

Roleren commented 2 months ago

Pseudo genes that are translated is a very common multimapper and microRNAs mapping to 3' trailers

Roleren commented 2 months ago

BTW, could you do one thing first, I need to see if it fails here or not:

ORFik:::convert_bam_to_ofst(df)

Then run using the ofst files you get as output. The point is that these are collapsed and compressed to fst, I always make collapsed ofst of the bam (BEFORE) I input into the detectRibosomeShifts function.

Since bam files is a very old format, it should only be loaded into R once, that is when calling convert_bam_to_ofst, after that only use bam to load custom flags (like multimapper flag etc).

Especially when you have a 39GB bam file, which is unusually large.

josefinaperalba commented 2 months ago

Hi! Thanks for all your repsonses! This strategy of converting to ofst before shifting worked for all the files smaller than 30GB so I'll just work with that limit. Thank you so much again!

Roleren commented 2 months ago

So to complete the biggest file you can do this trick (If the error is loading in the full 39GB bam):

# Goal: Read in 2 halfs, collapse duplicates, save 2 files, read in both, collapse the merge of these two and save out again.
library(ORFik)
df <- ORFik.template.experiment.zf() # Use your experiment here
bam_paths <- filepath(df, "default")
index_of_bigest_file <- which.max(file.size(bam_paths))
big_bam_path <- bam_paths[index_of_bigest_file]
number_of_reads_in_bam <- length(unlist(scanBam(big_bam_path, param = ScanBamParam(what = "qname")),
                      use.names = FALSE))

bamfile <- BamFile(bam, yieldSize=ceiling(number_of_reads_in_bam/2))
all_temp_files <- c()
open(bamfile)
while (length(chunk0 <- readGAlignments(bamfile))) {
  ofst_big_path_temp <- tempfile(fileext = ".ofst")
  message(ofst_big_path_temp)
  load_half_bam <- collapseDuplicatedReads(chunk0, # Read half file
                                           addScoreColumn = TRUE)
  export.ofst(load_half_bam, ofst_big_path_temp)
  rm(load_half_bam) # remove subset
  all_temp_files <- c(all_temp_files, ofst_big_path_temp)
}
close(bamfile)

# Merge and export
# Sadly, ofst does not support appending files on drive, so we need to read in both halfs and save again
subsets <- lapply(all_temp_files, function(path) {
  return(fimport(path))
})
lengths(subsets) # New lengths

ofst_basename <- paste0(ORFik:::remove.file_ext(big_bam_path, TRUE), ".ofst")
ofst_big_path <- file.path(dirname(big_bam_path), "ofst", ofst_basename)
combined <- collapseDuplicatedReads(sort(unlist(GAlignmentsList(subsets))), 
                                    addScoreColumn = TRUE)
number_of_reads_in_bam - sum(lengths(subsets)) # reads collapsed in chunks
sum(lengths(subsets)) - length(combined) # reads collapsed final vs chunks

export.ofst(combined, ofst_big_path) # Now you have the correct OFST
josefinaperalba commented 2 months ago

Hi! Sorry to bother but I saw that you commited an update where you fixed the bug in Shiftfootprints and I wanted to check if that already makes it possible for the function to accept a path as input. The reason I ask is because I've tried running it again with the path and the error I get comes from the ReadWidth function saying: Error in readWidths(footprints, along.reference = TRUE) : reads must be either GRanges, GAlignments, GAlignmentPairs or covRleList I'm not sure if the problem is how I'm installing the package and that maybe I'm not using the latest version or if the commit I saw is not the definite solution.

I'm installing the package in a docker image which Dockerfile is the following:

# Use the latest Rocker R image
FROM r-base:latest

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    libcurl4-openssl-dev \
    libssl-dev \
    libxml2-dev \
    libz-dev \
    liblzma-dev \
    libbz2-dev \
    libpng-dev \
    libgit2-dev \
    procps

# Install CRAN packages
RUN install2.r -e \
    BiocManager \
    optparse \
    tidyr \
    parallel \
    dplyr \
    remotes

# Install Bioconductor packages
RUN Rscript -e 'BiocManager::install(c("GenomicFeatures", "GenomicAlignments"), ask=F);'

# Install ORFik package from Bioconductor
RUN Rscript -e 'remotes::install_github("Roleren/ORFik")'

Thank you so much!

Roleren commented 2 months ago

Hey, try the newest commit, I documented it and should now work as you intended :)

josefinaperalba commented 1 month ago

Hi! everything is working as expected but just wanted to ask a question about how the code works. If I would downsize my bam files, lets say for example I only use certain regions of interest to filter the bam file, would the code work anyway? this would make sense? What if I only do it in bigger bam files where the function crash (bigger than 25gb)?