BiocParallel errors when running bambu for quantification

zzare-umd commented 1 year ago

Hi,

I am trying to run bambu with the following code:

se.quantOnly <- bambu(reads = "ERR3218366.bam", annotations = "../../data/human-ref-trans/gencode.v42.chr_patch_hapl_scaff.annotation.gtf", genome = "../../data/human-ref-trans/GRCh38.p13.genome.fa", discovery = FALSE)

But unfortunately, I got biocparallel error which shows below. Could you please help me that how I can resolve the issue?

Screenshot 2022-12-10 192108 Screenshot 2022-12-10 192138

Regards, Zahra

andredsim commented 1 year ago

Hi Zahra,

This error seems to be occurring very early in the running process and we have seen this happen when a different fasta file is used in bambu than was used for the alignment. Could you please double check that the chromosome names in the bam file match those in the fasta file?

You can check this in R with the following:

genomeSequence = FaFile("../../data/human-ref-trans/GRCh38.p13.genome.fa")
seqlevels(genomeSequence)
bamFile = prepareDataFromBam("ERR3218366.bam", verbose = FALSE, use.names = FALSE)
seqlevels(bamFile)

If they are the same, could you also try running bambu in verbose mode and let me know what is posted, as that may give me a better idea what may be going wrong.

Kind Regards, Andre

zzare-umd commented 1 year ago

Hi Andre,

Thanks for your quick response. I tried the checking code that you wrote above, and the results showed me that the chromosome names in the bam file match those in the fasta file. Then, I tried to run bambu in verbose mode and it did not give much more information than before. The output of bambu in verbose mode is as follows:

pac2

andredsim commented 1 year ago

Hi Zahra,

Hrmm this is a bit tricky to debug from my side since Bambu isn't producing anything meaningful in its output. Is there anything specific with your sample that might make it different from a standard long-read sample? Are you perhaps able to send me both the .bam file, .fa and .gtf file and I can try debug it from my end and hopefully get it running for you?

Unfortunately I will be unavailable over the next few days but I can look into this when I return next Tuesday.

Kind Regards, Andre Sim

andredsim commented 1 year ago

One other thing I realized I should is if the bam file aligned to the genome or the transcriptome? Could you also please include a screenshot with the chromosome names from the bam and genome file to see if there is perhaps a syntax reason in the naming that might cause this?

zzare-umd commented 1 year ago

Hi Andre,

I'm so sorry for late response. I checked the chromosome names from the bam and genome file and they are the same as shown below since the bam file is aligned to the genome.

The genome chromosome names are:

The bam file chromosome names are:

What is the best way that I can send the .bam file,. fa and .gtf files to you?

Regards, Zahra

andredsim commented 1 year ago

Hi Zahra,

Depending on the size of the files you could either email them to me directly (andre_sim[at]gis.a-star.edu.sg) or host it on a platform such as google drive, onedrive or dropbox and send me the link.

Kind Regards, Andre Sim

andredsim commented 1 year ago

Hi Zahra,

Thank you for sending the files. I will keep the discussion on here in case any other users encounter a similar issue in the future. I was able to investigate your .bam file more closely and I noticed that there are no spliced reads. Is it expected for your data to contain transcripts without introns? Bambu is designed to use spliced reads for transcript discovery and quantification and I cannot predict the quality of results without any splice junctions, however thanks to this report I will be adding a warning message and fixing the crash that occurs when only single exon reads are provided.

If it isn't expected, perhaps double check that the aligner you used allows for splice junctions. For example in minimap2 one needs to use -ax splice as an argument to enable splicing.

Hope this helps a bit and let me know!

Kind Regards, Andre Sim

andredsim commented 1 year ago

Hi Zahra,

If you were still wanting to run Bambu using your unspliced bam file I have identified and fixed the crash which occurs when running bambu without any spliced reads. You can find this fix in the BambuDev branch which you can download and install using the GitHub instructions in the documentation.

I do want to iterate and stress that Bambu wasn't designed to run without spliced reads and it was not evaluated in this way so I can't guarantee how well it will perform. Furthermore by default bambu excludes all single-exon (unspliced) reads from transcript discovery, if you do want to include these please see this section in the documentation (https://github.com/GoekeLab/bambu#Including-single-exons).

I tested it on the data and annotations you provided and because none of the unspliced reads match any annotations Bambu cannot train a discovery model or calculate and NDR. Therefore if you do want to go ahead running bambu in this unsupported way I would suggest manually setting the NDR threshold to your desired level of sensitivity. An NDR of 0.6 (quite sensitive already) gives 97 novel transcripts and this number increases dramatically with higher thresholds.

Hope this helps and have a great new year. Andre Sim

GoekeLab / bambu

BiocParallel errors when running bambu for quantification #340