GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
189 stars 24 forks source link

Non integer transcript counts #358

Closed rotem-a closed 1 year ago

rotem-a commented 1 year ago

Hi, I’m currently using bambu to obtain gene and transcript counts, I think it is a great package, and have a few questions:

Running bambu(reads = samples_vector, annotations = annotation, genome = genome_file, …) I obtain gene counts that are integers, but transcript counts are not.

There could be many different ways to obtain non integer counts, could you please explain a bit about the method you used in bambu to generate transcript counts? (Maybe for example if a read only covers half the transcript it will be counted as 0.5 ?)

Further, and maybe related to this, I find many very small floats in the transcript count matrix (count_transcripts.txt). Did you maybe come across this and can explain?

Thanks a lot :) Rotem

n.b - Some downstream analysis tools (like edgeR for example) require raw counts rather than cpm, so the CPM_transcript.txt file can’t be used instead. Analysis that needs to consider some low transcript counts (or even a simple plotSmear) can give strange results unless these floats are handled (ie, assuming these are just some numeric artefacts, simply replace: tx_counts$counts[tx_counts$counts <0.001 ]<-0.0 ).

gringer commented 1 year ago

Some downstream analysis tools (like edgeR for example) require raw counts rather than cpm, so the CPM_transcript.txt file can’t be used instead.

This is a requirement in downstream tools to discourage people from using sample-normalised CPM-like counts in downstream tools, because the raw counts are needed for properly working out shot noise when estimating p-values for differential expression statistics. If you're sure that the data represents a non-normalised read-level statistic, then it should be okay to round the values to integers first.

cying111 commented 1 year ago

Hi @rotem-a ,

We estimate the transcript expression through an EM algorithm: 1) firstly reads are collapsed into equivalence read classes by the set of transcripts that they are compatible with; 2) and then for read classes that are compatible to multiple transcripts, they will be probablistically assigned to compatible transcripts through the iterative expectation-maximization likelihood estimation approach, depending on the convergence treshold/criteria set, the estimates sometimes can go with a lot floating digits and of course not integers most of times. For more details, you can read the methodology part of our preprint here.

The floating numbers that are present with the results is seen by us also and it seems to be affected by dataset and system a lot. In the most recent update release of bambu (v3.0.8), we have implemented a fix for this problem by pre-set the rounding digits to 5, and it's customizable through argument opt.em = list(sig.digit = desired_number_of_digits). Please feel free to customise it according to your needs.

Ps, thanks @gringer for the answer above. For the downstream analysis, we also output the counts_transcripts.txt file, which can be used directly for, for example, differential expression analysis when raw counts are needed (rounding to integers is needed of course). The cpm_transcripts.txt, however, is more useful when you want to do benchmarking analysis or across sample analysis.

Hope I have addressed your questions!

rotem-a commented 1 year ago

This is very helpful, thank you!