GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
171 stars 22 forks source link

Clarification on `reads` argument #425

Open cjw85 opened 2 months ago

cjw85 commented 2 months ago

The README states multiple BAM files can be provided to bambu(). Reading the entire documentation suggests the intention here is that each BAM corresponds to a separate sample to be analysed jointly ("Running multiple samples").

Is it possible to provide multiple BAMs and have them interpreted as belonging to the same sample? Or even better provide a nested list-of-lists of BAMs for multiple samples.

If this is not possible directly is there a workaround? For example, is is possible to combine the intermediate rcFiles from multiple BAMs of the same sample?

My use case here is having performed alignment horizontally-scaled across machines and not wanting to aggregate multiple BAMs from a single sample into one large BAM before giving to bambu.

andredsim commented 2 months ago

Hi,

At this stage the reads argument expects that each individual file is its own sample and therefore your results would not be the same as if they were all in one bam file. I am currently working on some changes to arguments that would allow them to be merged within the running of bambu, but this release will not be for awhile so I would not wait for that.

Regarding workarounds, depending on how your bams are split it may be possible to combine the intermediate rcFiles without a significant change to your results, however this is quite advanced and I won't be able to promise to solve any issues that may arise as its an unsupported use-case.

With that caveat in mind, if you split the bam file by chromosome then combining the rcFiles should work with a simple rbind(), you will need to rename each of the read classes so they all have a unique rowname. Note that there will be slight differences in transcript discovery because bambu would train on each bam file separately. If this is important to you, you can run the trainBambu() function on your combined intermediate rcFile to get a model trained on the full dataset, and then apply it by either rerunning bambu again, or using the internal scoreReadClasses() function, providing your trained model as the defaultModels argument and setting fit = FALSE. If you bam files are split randomly the above method will not work without significant issues.

Depending on if your use case is mainly for transcript discovery or for quantification, and the size of each bam file, it may not be too detrimental to still run them as separate files in bambu, and then combine the counts data at the end. The larger the individual bam files, the better in this case.

Sorry that I can't give you a more concrete solution but I hope this helps, and if I need to clarify any of the above please let me know.

Kind Regards, Andre Sim

cjw85 commented 2 months ago

Hi Andre,

Thank you for your detailed response, its all very helpful. I will try some of the approaches you have suggested, and failing that take the hit on merging BAMs upfront.