GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
190 stars 24 forks source link

question on multisampling #421

Open sparthib opened 7 months ago

sparthib commented 7 months ago

I am trying the multisample mode, and running into out of memory issues which I expected to occur, but I can only assign a certain amount of RAM on my cluster. Is there a work-around for this.

So for example, I am able to run bambu with Quant = TRUE which still outputs some novel isoforms for each of my samples separetely, how would I go about comparing which novel transcripts were common between samples? Thanks!

andredsim commented 7 months ago

Hi,

Could you share me the script you are using to run Bambu, then I can suggest changes based on what you have already tried.

sparthib commented 7 months ago
library(bambu)
library(BiocFileCache)
library(readr)
library(sessioninfo)
library(dplyr)

# gtf.file <- "path_to_gtf/gencode_annotation.gtf"
# bambuAnnotations <- prepareAnnotations(gtf.file)
# saveRDS(bambuAnnotations, "path_to_annotation/annotations.rds")

task_id <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))

annotation <- readRDS("path_to_annotation/annotations.rds")

samples <- vector_of_sample_names

se_output_dir <- paste0("output_path", samples[task_id], "/")

se_quantOnly <- bambu(reads = paste0(bam_dir, samples[task_id], "_sorted.bam"),
                                  annotations = annotation,
                                  genome = fa.file,
                                  quant = TRUE)

writeBambuOutput(se_quantOnly, 
                 path = se_output_dir,
                 prefix = samples[task_id])

sessioninfo::session_info()

This is what I have so far!

Thanks, Sowmya

andredsim commented 7 months ago

Hi Sowmya,

Here are a few options to do multiple sample analysis when memory is limiting.

  1. You can try running all samples together but set lowMemory = TRUE
  2. You can produce the read classes separately first, then generate the annotations together, then do quantification separately (or today) se_readClass1 <- bambu(reads = paste0(bam_dir, samples[task_id], "_sorted.bam"), annotations = annotation, genome = fa.file, discovery = FALSE, quant = FALSE, rcOutDir = "output_path") This will produce .rds files in the output directory (rcOutDir) when can be used in future bambu runs to skip the first (memory intensive step). Alternatively you can save the se_readClass variable using saveRDS. extendedAnnotations <- bambu(reads = **c(se_readClass1, se_readClass2, se_readClass3),** #this can also be a vector of paths to the .rds files annotations = annotation, genome = fa.file, discovery = TRUE, quant = **FALSE**) The below step you can do 1 file at a time if you want to distribute it instead of the example where I have provided multiple files at once. As long as you use extendedAnnotations, the novel transcripts will be comparable between the outputs. extendedAnnotations <- bambu(reads = **c(se_readClass1, se_readClass2, se_readClass3),** #this can also be a vector of paths to the .rds files annotations = **extendedAnnotations**, genome = fa.file, discovery = **FALSE**, quant = **TRUE**)

Just a side point, I noticed you named your variable se_quantOnly. If you only want to do quantification and have no novel isoforms, then you need to set discovery = FALSE.

I hope this helps, Kind Regards, Andre Sim

sparthib commented 7 months ago

thank you, Andre!

sparthib commented 7 months ago

Suppose I split the samples by chromosome, and generated read classes for these separately, can I put these chromosomewise read class files together before generating the extended annotation?

andredsim commented 7 months ago

Yes this would be possible. This is how low_memory mode is meant to work, albeit it still needs to load the whole bam into memory. There are 2 caveats to splitting the bam file in advance by chromosome.

  1. There will be a lot less read classes for bambu to train on for each run, so this will impact performance of transcript discovery and to a lesser extent the junction correction. You could minimize this impact by turning off fitting and using the default model (this could either improve or decrease performance depending on read depth, so you would need to test this)
  2. If you fuse the read class files together manually in R (not with extend annotations) you will need to take care you rename all the read classes as they will by default have overlapping names which may cause problems.

Kind Regards, Andre Sim

sparthib commented 7 months ago

I was able to save my extended annotation to a gtf for quant = FALSE and discovery = TRUE. This is what it looks like when I load it.

extendedAnnotations <- bambu(reads = rds_files,
                             annotations = annotation, 
                             genome = fa.file, 
                             discovery = TRUE, 
                             quant = FALSE)
  writeToGTF(extendedAnnotations, "../multisample_output.gtf")   

I loaded this GTF file using the readGFF function from rtracklayer.

This is what the first row looks like.

seqid source       type start    end score strand phase.        gene_id     transcript_id exon_number
1 GL000009.2  Bambu transcript 56140  58376    NA      -    NA.      ENSG00000278704.1 ENST00000618686.1        <NA>

Does this result make sense? How do you move onto visualization from here?

Thanks, Sowmya

andredsim commented 7 months ago

Hi Sowmya,

Yes that looks fine. With regards to visualization it depends on what you are wanting to see, and there are lots of packages available for plotting the granges object stored in extendAnnotations. Bambu's inbuilt plotting functions require you to also perform quantification as we plot the expression of each model as well. Kind Regards, Andre Sim