explore parallel options to speed up

xhdong-umd commented 7 years ago

There are some computational intensive tasks in modeling stage can benefit from parallel computing.

parallel is a complex problem involved multiple layers, components, problems
we can get some speed up at least for embarrassingly parallel tasks
linux and Mac may see better performance, while windows may have more overhead and less gain.
we will try to create an uniformed approach that work across all platforms, utilize the performance gain for each platform if possible.

xhdong-umd commented 7 years ago

Parallel computing is a complex problem, and I found many articles try to make it look simple by hiding details. The examples may be very simple but I started to note lots of details once dig a little deeper. I spent several days to read and experiment extensively. The recent useR 2017 talk is a quite good high level summary.

From my experiment so far, mclapply is most simple to use and works well in Mac and shinyapps.io hosted app. However we need to have an approach that works on all platforms if possible. I'll create a windows VM to test it.

xhdong-umd commented 7 years ago

There are several methods of parallel:

mclapply, simplest to use, low overhead in theory. Only works with linux, Mac.
fork cluster, only work in linux, Mac, with more overhead but the code is almost identical with other cluster options.
sock cluster, work in linux, Mac, windows or more. This supposed to have more overhead.

Contrary to what I read, I found sock cluster is not slower compare to other 2.

I also heard that you should set the cores to physical cores - 1 because the main process is using one core already. However my test shows that I can use all the cores with better performance. I guess that's because the main process is not doing much so it doesn't hurt.

xhdong-umd commented 7 years ago

I have tested all the options in Mac and windows VM extensively.

My observation is that parallel in windows have some performance gains. In a VM with "4 cores", the speed up factor is 2.02. I believe a physical windows machine should have better results.

In Mac all options work similarly well. The differences are small and can vary a lot in different runs.

I put a RMarkdown document here. @chfleming Can you download it, knit the document and post the result in your machine? I want to compare it with my result.

UPDATE: I found knitr will not finish the report because of error of some options not available on windows. I've updated the RMarkdown to skip the non-working parts. In my home pc with 8 cores, 16 threads, 10 tasks that need 45s in serial took 12, 11, 9 secs for different cluster configurations.

Here is the result in my Mac. In summary, 10 tasks that will take 40 secs in serial took 12, 16, 12 secs in three parallel modes.

xhdong-umd commented 7 years ago

I received @chfleming 's test result in email.

In summary, the speed up from parallel in various platforms are:

Mac with 4 cores, 40/12 = 3.33
Windows with 8 cores, 45/9 = 5
Windows with 2 cores, 71/37 = 1.91

I think the speedup is quite good for these simple parallel tasks. Even the sock mode in windows also have good gains in performance. We definitely should use it.

It's tricky to find the optimal cluster size. With our individual tasks of running modeling functions on different individuals, it's best if the cluster size is equal or can divide the individual count. From my test, using cluster size of 10 with 10 individuals in a 4 core laptop still faster than using cluster size 4 or 5, contrary to common suggestions of setting cluster size to physical core count.

Of course there will be a limit of cluster size, and sometimes the individual count can be a prime number so it's impossible to find a smaller cluster size that can divide it.

I'll write a general parallel processing function that work across various platforms and implement in the app. Then we can play with the parallel settings in the app directly if needed.

From my preliminary test, the shinyapps.io server have 8 cores so we may get quite some gains in hosted mode.

xhdong-umd commented 7 years ago

Previously I thought we can use the cluster method (fork for linux/Mac, socket for windows) as the generic method, but it turned out shinyapps.io meet error with the cluster method, even the same code can run without problem in local mode.

My guess is Shiny server used forked R processes to handle user requests, so maybe the fork clusters interfered with the Shiny server and failed. I can use the mclapply method but that make the code more complex since we are maintaining 3 methods according to different platforms.

There is also no obvious reliable way to tell if the app is running locally or in shinyapps.io from within app itself. I think I have to rely on the fact that the shinyapps.io server used shiny as system user name.

The speed up in shinyapps.io is good though, from 54 s to 7.9 s.

chfleming commented 7 years ago

I've been writing some parallelized optimization code and this is what I do in generic.R to have mclapply for UNIX and lapply for Windows (ignore the fftw stuff):

`# parallel functions detectCores <- parallel::detectCores mclapply <- parallel::mclapply

.onLoad <- function(...) { if(is.installed("fftw")) { utils::assignInMyNamespace("FFT", FFTW) }

if(.Platform$OS.type=="windows") { utils::assignInMyNamespace("detectCores", function(...) { 1 }) utils::assignInMyNamespace("mclapply", function(X,FUN,mc.cores=1,...) { lapply(X,FUN,...) }) } } .onAttach <- .onLoad `

xhdong-umd commented 7 years ago

I think mclapply will become lapply with mc.cores=1 , so you can just set cores to 1 in windows, right?

if(.Platform$OS.type=="windows")
{
  detectCores <- 1
}

I'm working on a general function that use mclapply in shinyapps.io, use fork in linux and Mac, and use socket in windows. It seemed to work now but these parallel stuff are quite tricky.

xhdong-umd commented 7 years ago

I just updated the app with a testing feature.

select Buffalo data in ctmm
move to model fit page, adjust two parameters
- population is the number of individuals, click the button will fit all individuals
- subset size is the data size of each individual, bigger size need more time to fit. I'm actually just taking subset from first buffalo, so the population * size must < 3528.
the page will print the time needed for fitting them all in serial mode and parallel mode. For example with 7 population of 300, the shinyapps.io app gained 5x speed up.

You can run the app locally and test it, it should take different parallel method based on your platform. I have tested it in Mac, Windows and Shinyapps.io.

I also updated the app in another shinyapps.io account because I don't want to change the main account for testing purposes.

dracodoc commented 7 years ago

Running app in windows still has some bug. I need to sort it out tomorrow.

xhdong-umd commented 7 years ago

OK, the bug with windows is fixed now.

xhdong-umd commented 7 years ago

The performance gain of parallel in different platforms are as follows:

platform	cores	serial	parallel	performance gain
win 10 for 10 x 300 task, script	8	49	11	4.45
win 10 for 10 x 300 task, app	8	50	23	2.17
win 10 VM in Macbook 8 x 300 task, script	4	42	16	2.63
win 10 VM in Macbook 8 x 300 task, app	4	52	24	2.17
Macbook for 8 x 300 task, script	4	37	10	3.70
Macbook for 8 x 300 task, app	4	37	12	3.08
shinyapps.io hosted app, 8 x 300 task	8	54	8	6.75

There are significant overhead in the app, but the performance is still much better than before.

The generic function I wrote now work in windows, Linux/Mac, Shinyappos.io hosted mode, so I'm closing the issue.

ctmm-initiative / ctmmweb

explore parallel options to speed up #26