Is it good to run Ratatosk on multiple long read files separately?

lileiting commented 4 years ago

Hi,

I have four long read files. I am running Ratatosk (not reference-guided mode) on these four files but it has taken too long to finish running. I am wondering if I run Ratatosk for the four long read files separately in four computers (four nodes of a cluster), would the results be the same as in one machine? Furthermore, can I split the long reads into N parts and run Ratatosk for each part separately in N computers and afterwards merge the results?

Leiting

GuillaumeHolley commented 4 years ago

Hi @lileiting,

Unfortunately no, the result won't be the same but there are two solutions to this problem (well, one and a half). Let me explain the issue here. Ratatosk has two correction passes and in the second correction pass, the long reads corrected by the first pass are "mapped" to the graph built from the input short reads. If you do N separate corrections, one for each of your input long read file, it means each correction instance has only 1/N long read coverage available for the second correction pass. If 1/N is too low, the correction (second pass) will be impacted. Now, I never tried this before but if you can ensure 1/N to be a decent coverage (say bare minimum 15x give or take) , it might work just fine.

The other solution is to use the reference-guided preprocessing which was designed mostly to solve that problem. The idea is to use different machines to correct reads having a good map quality. If they have a good mapq, we're pretty sure where they anchor on the reference genome and parallel correction is possible by processing segments of the reference genome on different machines. Then, one last correction on a single machine takes care of all the long reads which don't map well or don't map at all. It It comes with a few drawback: it obviously works only with a reference genome of good enough quality and it might introduce a little reference bias in your correction. But if you look at the paper, we designed it to minimize the latter point and the results are really good.

Guillaume

lileiting commented 4 years ago

Hi Guillaume,

Thank you for your quick reply. I am de novo assembling the genome. So, I do not have a high quality reference right now. I will keep waiting, hope it could finish in a few days.

Leiting

DecodeGenetics / Ratatosk

Is it good to run Ratatosk on multiple long read files separately? #8