GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
190 stars 24 forks source link

Transcripts identified by Bambu have a CPM or count value of 0 in all samples #441

Open alexyfyf opened 4 months ago

alexyfyf commented 4 months ago

similar to this issue #422 , but I still have that after upgrading to 3.6.0.

> assays(oldse1)$count[grepl('BambuTx', rownames(assays(oldse1)$count)),] %>% rowSums() %>% sort %>% head
BambuTx580 BambuTx639 BambuTx201 BambuTx283 BambuTx643 BambuTx703 
         0          2          3          3          3          3 
> assays(se1)$count[grepl('BambuTx', rownames(assays(se1)$count)),] %>% rowSums() %>% sort %>% head
BambuTx583 BambuTx736 BambuTx662 BambuTx671 BambuTx739 BambuTx665 
         0          2          3          3          3          4 

The oldse1 is generated using bambu 3.3.4, the se1 is generated using bambu 3.6.0 The total number of transcripts in each version is similar: 253736 in old and 253738 in new. The command I used is exactly the same (including gtf and fa files).

bambu(reads = bam.files, 
             annotations = bambuAnnotations, 
             stranded = T, 
             genome = fa.file, ncore = 32)

In some other data, I have seen more than 100 BambuTx with 0 counts across all samples. Is there any thing I should check?

Thank you so much. Alex

cying111 commented 4 months ago

Hi @alexyfyf

Thanks for using Bambu and spotting this issue. We did observe 0 count estimates for novel transcripts.

There are two scenarios that this might happen: 1) the novel transcripts located at regions where multiple genes overlapping by some degree, and there are novel transcripts identified for each of these genes with very small differences, during read assignment, reads will be multi-assigned to these reads, but during quantification, only assignment to one gene will be kept, in this case, the novel transcript from the other gene will get no counts 2) the novel transcript is a subset of another novel transcript which could differ or not differ in shared junctions, in such case, the longer novel transcript will get all the read counts, while the subset one will not get counts

Hope this clarified your question and let us know if you spot cases not belonging to any of these scenarios, we will then further investigate on it

Thank you Warm regards, Ying

ritatam commented 3 months ago

Hello Ying @cying111

I had the same issue as Alex. Thanks for your explanations, I'm glad I found them as they match what I have observed. I'm working on a fungus that has overlapping gene annotations, some of which might be spurious predictions. It did seem that the transcripts are counted towards the longer ones, like you described here.

Here's my example - based on this alignment I suspect the gene 011336 might be a spurious annotation, but bambu counted the transcripts towards the longer 011336, while 011337 got close to zero.

image

A quick dirty workaround for me was to switch from transcriptome to CDS coordinates as "exons" (highlighted in yellow), so the reads are not captured by 011336 "non-coding exons".

Before Pst104E137_011336 32 31 24 34 Pst104E137_011337 3 4 2 0

After Pst104E137_011336 1 0 0 0 Pst104E137_011337 34 35 26 34

My command was bambu(reads=bam, annotations=anno, genome=fa, discovery=FALSE, opt.discovery = list(min.txScore.singleExon = 0))

I don't think this is an error, but more of a behavioural issue. It would be great if bambu could have an in-built option for this to allow some flexibility. Hope this will help with the investigation, if any!

Much appreciated! Rita