dcgerard / updog

Flexible Genotyping of Polyploids using Next Generation Sequencing Data
https://dcgerard.github.io/updog/
24 stars 8 forks source link

Implemented a different version of the multidog parallel loop #17

Open alethere opened 3 years ago

alethere commented 3 years ago

Hi David,

I contacted you some time ago (November last year) suggesting a different approach to multidog, writing to files instead of outputting a data.frame. I see that since then you've changed to the future package for the parallelization management. I have not implemented the algorith using future but I imagine it will work the same.

My test on 100K SNPs shows the following time usage: Function athe_small took 1.66 h Function athe_all took 1.94 h Function multidog took 3.15 h

Where athe_small is multidog writing only the snp parameters (thinkgs like prop_mis that have one estimate per marker) and the genotypes; athe_all that writes all possible outputs in different tables; and multidog which is the original implementation.

You see that the efficiency improvement on time is relatively small. I suspect memory usage should be better, as that's what I found when doing it on my own computer, although I couldn't confirm it in the computer cluster where I performed the test above (reading memory usage turns out to be more complicated than I anticipated).

Small overview of the function changes:

Let's see what you think.

Cheers, Alejandro

PS: Sorry for the delay with submitting, some other research got in the way.

dcgerard commented 3 years ago

Hey @Alethere, thanks so much for doing all of this!

I just want to pop in real quick and say that I've been really busy, so haven't had a chance to check things out. I'll get around to looking at the changes.

One quick comment: Maybe it would be better to create a new function, rather than replace the multidog() function. I love that your method takes less time, but some use-cases would work better without having any corrupted lines. So we could have a new function, say parwdog() (for parallel writing updog), that a user could use for speed improvements, but possible line corruption? Let me know what you think.