GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
170 stars 22 forks source link

Interpreting warning messages in discovery only mode #415

Open sparthib opened 4 months ago

sparthib commented 4 months ago

Hello, I have a bam file where my reads are aligned to the GENCODE GRCh38.p14.genome.fa file. The corresponding gtf file is gencode.v44.chr_patch_hapl_scaff.annotation.gtf

I tried running BAMBU in discovery only mode

bambuAnnotations <- prepareAnnotations(gtf.file)
se.discoveryOnly <- bambu(reads = test.bam,
                          annotations = bambuAnnotations,
                          genome = fa.file,
                          quant = FALSE,
                          rcOutDir = se_output_dir,
                          NDR = 1)

I reloaded the output rcFile as

se.discoveryOnly <- bambu(reads = c("test_rc.rds"),
                                 annotations = bambuAnnotations, genome = fa.file)

In my resulting se file, I see these 3 warnings in the metadata:

# [1] "not all chromosomes present in reference annotations, annotations might be incomplete. Please compare objects on the same reference"
# [2] "25701 reads are mapped outside the provided genomic regions. These reads will be dropped. Check you are using the same genome used for the alignment"
# [3] "No aligned spliced reads detected!Bambu expects spliced reads. If this is intended, see Documentation on how to handle single-exon transcripts"

Also, for each transcript in my bambuAnnotations object, I see the following message as well. 505 sequences from an unspecified genome; no seqlengths

I seem to be using matching reference genome fastas and gtfs. Any pointers as to what could be wrong would be appreciated.

Thanks,

Sowmya

andredsim commented 4 months ago

Hi Sowmya,

The first two warnings are more informational, than detrimental to your analysis. The first can occur when there are reads to scaffolds in the fasta file, that have no annotations in the gtf file, for example on accessory chromosomes, which is quite common. The second is a consequence of some aligners which can align part of a read near the end of scaffolds, but the entire read extends beyond those boundaries. I have observed this myself in my own data and we havn't come up with a clean solution, which is why we only report how many reads we remove from the analysis, so users can know how much of their data is impacted.

The third warning however is critical. By default bambu expects spliced reads, and in most use-cases they should be present in the data. The most typical cause for a lack of spliced reads is either the user mapped the reads to the transcriptome (which doesn't seem to be the case in this instance), or they ran an aligner such as minimap2 without setting the splice aware parameter.

Let me know how you aligned your data, or if you don't expect spliced reads (prokaryote sample, targeted amplification sequencing etc), and we can go from there.

Kind Regards, Andre Sim

sparthib commented 4 months ago

Hi Andre,

Thank you for your quick and detailed response. I believe this is because I ran minimap2 without setting the splice aware parameter. Will update this thread after rerunning alignment.

Thanks! Sowmya