Open sparthib opened 7 months ago
Hi,
Could you share me the script you are using to run Bambu, then I can suggest changes based on what you have already tried.
library(bambu)
library(BiocFileCache)
library(readr)
library(sessioninfo)
library(dplyr)
# gtf.file <- "path_to_gtf/gencode_annotation.gtf"
# bambuAnnotations <- prepareAnnotations(gtf.file)
# saveRDS(bambuAnnotations, "path_to_annotation/annotations.rds")
task_id <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))
annotation <- readRDS("path_to_annotation/annotations.rds")
samples <- vector_of_sample_names
se_output_dir <- paste0("output_path", samples[task_id], "/")
se_quantOnly <- bambu(reads = paste0(bam_dir, samples[task_id], "_sorted.bam"),
annotations = annotation,
genome = fa.file,
quant = TRUE)
writeBambuOutput(se_quantOnly,
path = se_output_dir,
prefix = samples[task_id])
sessioninfo::session_info()
This is what I have so far!
Thanks, Sowmya
Hi Sowmya,
Here are a few options to do multiple sample analysis when memory is limiting.
lowMemory = TRUE
se_readClass1 <- bambu(reads = paste0(bam_dir, samples[task_id], "_sorted.bam"), annotations = annotation, genome = fa.file, discovery = FALSE, quant = FALSE, rcOutDir = "output_path")
This will produce .rds files in the output directory (rcOutDir) when can be used in future bambu runs to skip the first (memory intensive step). Alternatively you can save the se_readClass variable using saveRDS.
extendedAnnotations <- bambu(reads = **c(se_readClass1, se_readClass2, se_readClass3),** #this can also be a vector of paths to the .rds files annotations = annotation, genome = fa.file, discovery = TRUE, quant = **FALSE**)
The below step you can do 1 file at a time if you want to distribute it instead of the example where I have provided multiple files at once. As long as you use extendedAnnotations, the novel transcripts will be comparable between the outputs.
extendedAnnotations <- bambu(reads = **c(se_readClass1, se_readClass2, se_readClass3),** #this can also be a vector of paths to the .rds files annotations = **extendedAnnotations**, genome = fa.file, discovery = **FALSE**, quant = **TRUE**)
Just a side point, I noticed you named your variable se_quantOnly. If you only want to do quantification and have no novel isoforms, then you need to set discovery = FALSE.
I hope this helps, Kind Regards, Andre Sim
thank you, Andre!
Suppose I split the samples by chromosome, and generated read classes for these separately, can I put these chromosomewise read class files together before generating the extended annotation?
Yes this would be possible. This is how low_memory mode is meant to work, albeit it still needs to load the whole bam into memory. There are 2 caveats to splitting the bam file in advance by chromosome.
Kind Regards, Andre Sim
I was able to save my extended annotation to a gtf for quant = FALSE
and discovery = TRUE
. This is what it looks like when I load it.
extendedAnnotations <- bambu(reads = rds_files,
annotations = annotation,
genome = fa.file,
discovery = TRUE,
quant = FALSE)
writeToGTF(extendedAnnotations, "../multisample_output.gtf")
I loaded this GTF file using the readGFF
function from rtracklayer.
This is what the first row looks like.
seqid source type start end score strand phase. gene_id transcript_id exon_number
1 GL000009.2 Bambu transcript 56140 58376 NA - NA. ENSG00000278704.1 ENST00000618686.1 <NA>
Does this result make sense? How do you move onto visualization from here?
Thanks, Sowmya
Hi Sowmya,
Yes that looks fine. With regards to visualization it depends on what you are wanting to see, and there are lots of packages available for plotting the granges object stored in extendAnnotations. Bambu's inbuilt plotting functions require you to also perform quantification as we plot the expression of each model as well. Kind Regards, Andre Sim
I am trying the multisample mode, and running into out of memory issues which I expected to occur, but I can only assign a certain amount of RAM on my cluster. Is there a work-around for this.
So for example, I am able to run bambu with
Quant = TRUE
which still outputs some novel isoforms for each of my samples separetely, how would I go about comparing which novel transcripts were common between samples? Thanks!