guidance on memory and multithreading

GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data

GNU General Public License v3.0

180 stars 22 forks source link

guidance on memory and multithreading #208

Closed jackhump closed 3 years ago

jackhump commented 3 years ago

Hi, I'm excited to try out bambu. I'm currently working on a dataset which consists of 30 PacBio sequel II runs, which come to 88 million FLNC reads in total. I've been struggling to find a tool that can generate a single merged transcript reference and a count matrix for all samples together, which I could then perform downstream filtering and analysis with SQANTI and SWAN.

I'm currently getting to grips with bambu by using a couple of the smaller files but I would like to eventually scale up to the full dataset. Would you be able to advise on the kinds of resources I would need to request to process the dataset with bambu within a reasonable time frame (< 7 days). For comparison, TALON and isoquant ended up using up to 135 GB of memory with 8-16 cores, taking between 4 and 6 days.

Thanks in advance!

cying111 commented 3 years ago

Hi @jackhump , glad to hear that you are trying out our package. For the number of samples that you give, I would suggest you to run bambu with a memory of 124GB and 10 cores, it should be able to allow you to process the dataset within the requested time frame.

jackhump commented 3 years ago

In the end, I managed to run bambu on all the samples by specifying 8 cores with 24GB of memory (192GB total). The run had a peak memory usage of 184GB and took < 3.5 hours to complete. I'm very impressed in how quickly bambu managed to process my data.

I know you're currently preparing a manuscript describing the bambu method but would you be able to send me anything I can read now on how the method works?

best wishes,

Jack

jackhump commented 3 years ago

I have a few specific questions now I'm looking more through the results:

1) The values in the transcript count matrix, how are these calculated? As they are decimals rather than integers, can I assume these are ML estimates? 2) I ran the GTF through SQANTI and it classifies almost no transcript as ISM (incomplete splice match) - why is that? 3) Looking at specific genes I'm noticing very few intron retention transcripts present, compared to earlier GTFs I assembled with Cupcake and TALON. Does bambu treat intron retention transcripts in a particular way?

cying111 commented 3 years ago

Hi @jackhump , it's great to hear that you have managed to run bambu on your samples. For you specific questions, here are some quick answers:

count estimates are more than counts, we have used EM algotithm to do some estimation, so the count values reported are EM estimates.
I am quite happy that you bring this point up, in the bambu parameters, opt.discovery, there is a remove.subsetTx, which is used to indicate whether subset transcripts will be removed, i.e., incomplete splice match cases, by default, this parameter is set to TRUE. if needed, you can always set this parameter to FALSE.

We are preparing a manuscript for this, you can share with us your email address so that we can inform you when our preprint is out. Do let us know if you have further questions and feel free to reach to us through emails.