GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
190 stars 24 forks source link

de novo mode error #405

Closed haydenji0731 closed 11 months ago

haydenji0731 commented 11 months ago

When I try to run the following command:

se <- bambu(reads = bam_fn, annotations = NULL, genome = genome_fn, NDR = 0.5, quant = FALSE)

I get an error like this:

--- Start extending annotations --- NDR will be approximated as: (1 - Transcript Model Prediction Score) Error in filterTranscriptsByAnnotation(rowDataCombined, annotationGrangesList, : WARNING - No annotations were provided. Please increase NDR threshold to use novel transcripts Calls: bambu ... isore.extendAnnotations -> filterTranscriptsByAnnotation Execution halted

Is there something that I'm missing? I'd like to run Bambu in its de novo mode.

andredsim commented 11 months ago

Hi Hayden,

This error is caused because after filtering during the extend annotations step there was no candidate transcripts left as signified by "WARNING - No annotations were provided. Please increase NDR threshold to use novel transcript". There could be a few reasons for this including all the candidate transcripts being scored lower than 0.5 (which is a concern it itself), or an issue in the bam file such as it not being aligned with splice mode on resulting in no spliced transcripts.

One quick test would be to rerun the line but set NDR to 1. This should include all possible transcripts regardless of the quality. If this solves the issue than it means its likely the pre-trained model provided by bambu is not well suited for your particular data, and you might need to retrain a new one.

If you continue to get the same issue, I would double check how you generated the bam file, making sure it was aligned to the genome, and not the transcriptome, splicing was turned on, etc. You could then try rerun bambu with verbose on which might provide a clue with what is happening.

Kind Regards, Andre Sim

haydenji0731 commented 11 months ago

Thank you for the reply.

BAM file isn't the issue here since the same BAM file was used for a guided run using human RefSeq annotation and Bambu threw no error. Splicing was indeed turned on and reads were aligned to the genome.

After setting NDR to 1, I get a different error with writeBambuOutput function. It seems like a dependency issue but it's odd that the error only occurs in de novo mode. Does this mean the output se object is empty?

--- Start extending annotations --- combing spliced feature tibble objects across all samples in 0.3 mins. extract new unspliced ranges object for all samples in 0 mins. reduce new unspliced ranges object across all samples in 0 mins. combine new unspliced tibble object across all samples in 0 mins. combining transcripts in 0.3 mins. extended annotations for spliced reads in 0 mins. extended annotations for unspliced reads in 0 mins. NDR will be approximated as: (1 - Transcript Model Prediction Score) transcript filtering in 0 mins. extend annotations in 0.1 mins. Error in MatrixGenerics:::.load_next_suggested_package_to_search(x) : Failed to find a rowRanges() method for CompressedGRangesList objects. However, the following packages are likely to contain the missing method but are not installed: sparseMatrixStats, DelayedMatrixStats. Please install them (with 'BiocManager::install(...)') and try again. Alternatively, if you know where the missing method is defined, install only that package. Calls: writeBambuOutput -> rowRanges -> rowRanges -> <Anonymous> Execution halted

Also, why would the Bambu pretrained model be not well suited when it's trained on human data and I'm also analyzing human?

andredsim commented 11 months ago

Hi Hayden,

Ok good to know that the bam file has run through normally. Its the most common issue we encounter so I always need to check.

writeBambuOutput will not work with your output as you didn't run the full bambu package as quant is set to false. The output function you need is writeToGTF(). You can find the syntax for that here (https://github.com/GoekeLab/bambu?tab=readme-ov-file#output). I am not sure how that would have worked when you ran it not in de novo mode, that is quite odd.

The species is not the biggest issue when it comes to the pretrained model (with some exceptions), rather the 'shape' of the transcriptome. Large differences in the sequencing depth, degradation rate, number of expressed transcripts, can all impact the pre-trained model, which is why we always recommend training using preexisting annotations when its possible. It could also be that the overall ranking of output is fine with NDR = 1 but the TPS values are low. You could check to see if the top scored transcripts (the lowest NDR) do match transcripts you expect to be in the sample and then choose a desired cutoff threshold manually.

May I ask what your research question is that requires de novo transcript discovery in human, perhaps I can provide some guidance on how or if bambu can effectively be used for that?

Kind Regards, Andre Sim

haydenji0731 commented 11 months ago

Ah, that's a good point. I had set quant = False for debugging. In guided runs, I set it to True.

Thank you for your advice.