speed up overlap, integrate parallel to ctmm

xhdong-umd commented 6 years ago

@chfleming I tried to add parallel option for overlap since I thought it need to calculation many combinations of animal and they should be independent from each other.

I added a branch parallel_overlap, made the calculation parallel by animal combinations. However I'm not seeing speed improvement with bigger data set and combinations.

Only after profiling the function I realized the major time consuming part is that overlap calculate the akde of telemetry objects.

I'm wondering if the example in overlap help can be added with these lines, since user may just following the example and didn't realize sometimes they should use existing home range objects if available.

AKDE <- akde(buffalo[1:2], CTMM = FITS)

# AKDE overlap between these two buffalo
overlap(AKDE,FITS)

# or you can use telemetry object directly, 
# and overlap will calculate AKDE first automatically, 
# which can take some time so reusing existing AKDE is a good idea.
overlap(buffalo[1:2],FITS)

So the actual overlap calculation never take much time, and there is no need to parallelize it, right?

Though I'm wondering if you want to integrate the generic parallel functions into ctmm. At least the often used FITS <- lapply(1:2, function(i) ctmm.fit(buffalo[[i]],GUESS[[i]]) ) can be parallelized.

I noticed the previously export par_fit_tele is actually using ctmm.select, not ctmm.fit. I renamed it into par_try_models, and wrote a new par_fit_models as the parallel version of ctmm.fit. Thus the line above can be written as par_fit_models(buffalo[1:2]).

If you feel it's useful to use parallel in more places inside ctmm, we can move the parallel functions into ctmm so it doesn't depend on ctmmweb, since the parallel functions only require parallel package.

chfleming commented 6 years ago

With overlap, if you calculate the AKDE UDs beforehand, then that part takes all of the time. I can update the help file example as you suggest.

As far as parallelizing akde(), it shouldn't be parallelized by default because of RAM issues. People sometimes run out of RAM with akde and occurrence even without parallelization (and have to lower their resolutions). But it could be parallelized optionally with an argument mc.cores=1 or something. I have been parallelizing some functions with mclapply, as it has very low overhead (though it requires UNIX). I should look at your at your parallel wrapper again and we can discuss at the next ABI meeting.

xhdong-umd commented 6 years ago

I didn't use parallel mode on akde in the app, because I want to calculate all the animals in one akde call so they will be on same grid. I did use parallel mode on occurrence.

The parallel wrapper used mclapply in linux/Mac and parLapplyLB (a socket cluster mode with more overheads) in windows, also generated some default parameters so user don't need to know about it.

chfleming commented 6 years ago

Looking at par_lapply (if I am reading it right), there needs to be an extra mc.cores-like argument for a scientific computing context where jobs might run on a cluster where you don't know what nodes you are going to get (and how many cores they have), but you do know how many processing units you are allotted. The reserved_cores argument could then kick in if the mc.cores argument is left undefined/blank.

xhdong-umd commented 6 years ago

The actual cluster size/mc.cores parameter is calculated with some heuristics based on some of my experimentations. I wanted users to use it as simple as possible and not to worry about the details.

If the input list length is n, and the cores available is m, parallel functions will create a cluster with cluster_size threads, and run n jobs in that cluster.

CPU can have 2 logical cores per physical core (hyperthread) but that doesn't make much difference, the real meaningful number is the physical cores.
For example if the cpu have 4 cores, I still found a cluster with 8 threads work better than 4 threads. Because more threads allows better load balancing, which has more advantages over the overhead it created.
In practice, I just set up the upper limit of cluster_size to be
- logical cores count in windows, which can be 2 * physical cores
- 4 * physical cores count in linux/mac
- or the actual length of input list. We don't need a cluster bigger than the input list.
In another word, I just set up the cluster size to equal the input list size, if that is not too big for the available cpu cores.
All the above may use up all the cores, so reserved_cores will override them if provided.

xhdong-umd commented 6 years ago

For the reason of setting cluster_size to list length:

If we have 5 objects in list and 4 physical cores, creating a 4 core cluster means the 5th task need to wait in 2nd round, then only 1 core is used in that round.
If we create a 5 core cluster, even there are only 4 physical cores, because the jobs are distributed to each core evenly, it still take less time than above configuration.

xhdong-umd commented 6 years ago

@chfleming, Reading your comment again, now I understand what you means. So sometimes the code can be run in a cluster where only a subset of all cores are available to user, while detect_cores only report total cores in machine. We will definitely need a cores parameter for this.

chfleming commented 6 years ago

Following up on our discussion, I think separating plapply into two functions: one to parallelize given a fixed core/threads argument and another to detect hyperthread count would be ideal for me. I would then want to merge this with my own mclapply and detectCores wrappers that I use for safe mclapply without Windows complaining. I will put in an argument to switch between cases where overhead undesirable and vanilla lapply is used on Windows and cases where overhead is insignificant and your socket code is used on Windows. That will cover all of my use cases internally.

xhdong-umd commented 6 years ago

I found it's really difficult to extract the core detection part as a function, because it involved platform, input list size, reserved core value etc. To abstract them out of the function will need to transfer all the parameters in and out, and there are some if/else still cannot be saved.

So I want to use cores = NULL, and call my core detection code with the default NULL value.

This should not interfere with your detectCores wrappers, because you can just assign the cores parameter with your own function, as long as it generate any positive integer it will be used directly.

I'm also thinking to enable negative values, and -m means to reserve m cores for user, so there is no need for an additional parameter.

And I have a parallel = TRUE parameter already, which can use lapply when given parallel = FALSE.

We can also make the function to use lapply when cores = 1, but I think it's better to have a explicit control parameter.

xhdong-umd commented 6 years ago

I have updated the function to implement the cores parameter.

cores: the core count to be used for cluster. Could be a positive integer or

Default NULL value will indicate to use a heuristic value based on detected cores, which is roughly min(input_size, physical_cores_count * n), n being 2 for windows, 4 for Mac/Linux. See ?parallel::detectCores for more information on physical/logical cores in different platforms.
A negative value like -2 will use all available cores - 2, so that 2 cores are reserved for user's other tasks.

So you can call the function with cores = 4, cores = -1 or cores = your own function(), parallel = FALSE etc.

One special requirement for the function is that you need to align_list for functions with more than 1 parameters.

Will this serve your need?

chfleming commented 6 years ago

That sounds good. I will work this into ctmm and have you critique how I mangle it.

xhdong-umd commented 6 years ago

I forgot to mention the function used crayon package for colorful console messages. You can either replace it with regular message, or import crayon if you like it.

chfleming commented 6 years ago

I don't want to have messages on the command line here without a trace option passed, but I do need to go through the package and use crayon to differentiate various messages and warnings.

xhdong-umd commented 6 years ago

OK, we can add an parameter to control the messages like msg=TRUE.

chfleming commented 6 years ago

Don't worry about it, I have completely different needs for the command line package and am restructuring for that anyhow. The webapp needs to print out basically everything and I understand that.

xhdong-umd commented 6 years ago

OK, since we are making individual copies in each package, so maybe it's actually easier for us just maintain different versions, and only sync some core parts if needed.

chfleming commented 6 years ago

Alright, I incorporated the basic code here: https://github.com/ctmm-initiative/ctmm/blob/master/R/parallel.R and tested in Windows. I will test in Linux tomorrow.

@NoonanM The mc.cores arguments are now all changed to just cores to be more general.

xhdong-umd commented 6 years ago

Looks good to me. The core code is just about the cluster and environment setup, and we have different needs on cores count or default mode.

I think current setup is ideal, we can keep different versions, and just share the core code like the cluster/environment part.

xhdong-umd commented 6 years ago

@chfleming , do you need to integrate other functions into ctmm? We talked about the group plot function of variogram, though I think you said user can just use the ctmmweb version?

If there is no more changes needed, I'll update the package website to reflect recent updates.

chfleming commented 6 years ago

Yeah, if you're going to be updating those functions, then I think its best that users have the latest versions. Parallelization was the only thing that I needed internally.

xhdong-umd commented 6 years ago

OK. I'll update the website, also I'm looking at possible points that can be included in the paper.

ctmm-initiative / ctmm

speed up overlap, integrate parallel to ctmm #16