dcgerard / updog

Flexible Genotyping of Polyploids using Next Generation Sequencing Data
https://dcgerard.github.io/updog/
24 stars 8 forks source link

Improvement of multidog #12

Open alethere opened 3 years ago

alethere commented 3 years ago

Hi, I've been using updog to genotype GBS data for a whole genome as part of my PhD thesis and thus I have very big datasets (normally of the order of 300K variants but one of them happened to be 1.8M variants). At the time I started genotyping multidog was not implemented, so I wrote my own parallel implementation of flexdog.

Initially I had a major issue with memory and time efficiency in my computer cluster, which I managed to solve with some nifty tricks. Now I saw that there's the new multidog and I've been doing some tests (not done yet) in which it seems that my approach is almost 1000x more memory efficient and 40x faster than the current multidog implementation. Using profvis readings, running 5k markers on 25 cores multidog took 11332Mb and 28520ms where my implementation took 11.6Mb and 730ms. I attach the profvis object for you to check (to view unzip the file, and load into R using result <- readRDS(); useprint(result) after loading the profvis library, and select the "Data" tab, not the "Flame Graph" tab). updog_test_profiling.zip

I'd be happy to collaborate on the code but it would imply a major re-write of how multidog works. Is that okay? Should I just put forth a pull request? (I haven't used Github a lot so a bit of guidance would be helpful)

Cheers, Alejandro Thérèse Navarro

dcgerard commented 3 years ago

This looks really cool!

If I am reading the profiling correctly, it seems most of the memory issues are occurring during the following lines?

inddf <- do.call("rbind", lapply(outlist, function(x) x[[1]]))
snpdf <- do.call("rbind", lapply(outlist, function(x) x[[2]]))

Which occur here:

https://github.com/dcgerard/updog/blob/6a23e6e613cd23a0858ff781210c0e141e2494a6/R/multidog.R#L364

In which case maybe data.table::rbindlist() would be a quick way to correct the memory issues? Not sure how much this would help with speed.

I would love to see your solution! If you want to give GitHub a try, then you could fork the updog repo, modify the repo, and submit a pull request: https://guides.github.com/activities/forking/

I would suggest just adding a file with your functions, rather than modifying multidog(), so we can add a unit test that they produce the same output. We can later change the current version of multidog() to something like multidog_old(). There are a lot of functions that depend on having the output formatted the way multidog() returns (a list with two data frames, inddf and snpdf), so I would like to keep the output format the same.

If you prefer collaborating by email, that's fine too!

Thanks! David

alethere commented 3 years ago

Nice, I'll prepare a pull request then. My code is adapted for my own use atm so I need to generalize it before adding it but that shouldn't take too long.

Indeed it seems that it's the list modification rather than the parallel loop itself what's taking up a lot of memory and time. I have some ideas on how to handle these issues at least for very large datasets. I'll send you an e-mail with the ideas let's see what you think.

Cheers, Alejandro