Open alethere opened 3 years ago
Hey @Alethere, thanks so much for doing all of this!
I just want to pop in real quick and say that I've been really busy, so haven't had a chance to check things out. I'll get around to looking at the changes.
One quick comment: Maybe it would be better to create a new function, rather than replace the multidog()
function. I love that your method takes less time, but some use-cases would work better without having any corrupted lines. So we could have a new function, say parwdog()
(for parallel writing updog), that a user could use for speed improvements, but possible line corruption? Let me know what you think.
Hi David,
I contacted you some time ago (November last year) suggesting a different approach to multidog, writing to files instead of outputting a data.frame. I see that since then you've changed to the future package for the parallelization management. I have not implemented the algorith using future but I imagine it will work the same.
My test on 100K SNPs shows the following time usage: Function athe_small took 1.66 h Function athe_all took 1.94 h Function multidog took 3.15 h
Where athe_small is multidog writing only the snp parameters (thinkgs like prop_mis that have one estimate per marker) and the genotypes; athe_all that writes all possible outputs in different tables; and multidog which is the original implementation.
You see that the efficiency improvement on time is relatively small. I suspect memory usage should be better, as that's what I found when doing it on my own computer, although I couldn't confirm it in the computer cluster where I performed the test above (reading memory usage turns out to be more complicated than I anticipated).
Small overview of the function changes:
Let's see what you think.
Cheers, Alejandro
PS: Sorry for the delay with submitting, some other research got in the way.