Unsatisfactory Performance Due To Overhead?

We are in the midst of evaluating Azure for our company to perform resource intensive ML tasks. Thus far, the performance achieved has fallen short of our expectations. The benchmarking example below takes 18 minutes to complete on 51 F_16 machines (816 cores = 3264 workers!), on my local 7 core laptop it takes 15 minutes to run. Needless to say, this is very disappointing.

Azure Test Environment

require(doAzureParallel) yyy <- rpois(1000, 2)

xxx <- data.frame("N" = rnorm(1000), "T" = rt(1000, 5, 2), "B" = rbinom(1000, 1, 0.3), "W" = rweibull(1000, pi))

cluster <- doAzureParallel::makeCluster("cluster.json")

doAzureParallel::registerDoAzureParallel(cluster)

foreach::getDoParWorkers()

modCtrl <- caret::trainControl(method = "repeatedcv", number = 15, repeats = 15, search = "random") ela <- Sys.time() modTune <- caret::train(x = xxx, y = yyy, method = "svmRadial", tuneLength = 15, trControl = modCtrl) ela <- Sys.time() - ela ela

doAzureParallel::stopCluster(cluster)

The code creates 3375 tasks which take about 15 minutes to submit, 3 seconds to calculate and 3 min to merge. With task submission clearly being the bottleneck I tried different chunk sizes. The best performance achieved with setChunkSize = 100 was approx. 8 minutes. So while I increased computation power by the factor 117x I could barely reduce computation time by the factor 2x.

Is there something I am missing? Is this a result of excessive overhead?

Hi @unitroot,

Thanks for bringing this up. You've identified some good points of feedback and we can go through some of this now.

Submitting tasks takes 15 minutes We have had a few other customers report the same issue, and we have addressed it in our latest branch for doAzureParallel. Please make sure you have installed from the master branch to get these changes as they have not been merged into an official release yet:
```
devtools::install_github('azure/doazureprallel')
```
The reason behind this was that we used to upload a lot of duplicate data as part of the loop and we have now updated the package to only upload the shared data once, and then have each iteration only upload the bit of data it needs for it's own execution. This has show massive performance improvements in submitting jobs. In the example you posted above, with a chunk size of 100 I was able to submit a job in just a few seconds.
Fast calculation times I tried running your code snippet above with a chunk size of 100, but found that roughly the average execution time was 1 minute for each chunk. That makes sense to me since you're actually doing 100 calculations per chunk (so each calculation is about 0.6 seconds)
Several minutes to merge This is similar to the first issue I pointed out and something we are working on. Today, we have a single merging thread that handles merging all the output. We are working on a more distributed merge which should help improve performance. This is expected to ship roughly by the end of the month. That said, when running your example above with a chunk size of 100, the merge seemed to only take a few seconds. General rule of thumb is that if you have many many tasks (low chunk size) you often see the merge task taking a lot longer since it has to aggregate many small results rather than fewer big ones.
Excessive overhead I also noticed one other huge time blocker when submitting tasks which was the installation of caret on each node. The output in R looks like this:
```
Job Preparation Status: Package(s) being installed............
```
In my case it took 5 minutes for caret to get installed on each of the workers before work could start there, and this is a significant contributor to the total wall clock time. One important upgrade you can make to basically reduce this to 0 is to have a docker image which has caret pre-installed so you don't need to wait for each node to install it ahead of time. We have some documentation regarding how to do this, but if I would be more than happy to help out here since this is one of our core samples and we would benefit from having an optimized docker container for it. A quick search online also showed that someone has already done this here. You could simply try updating your cluster.config file and set the containerImage field accordingly:
```
...
"containerImage": "jrowen/dcaret",
...
```
(NOTE: I have not tested the above, it is just a point of reference to follow or try out yourself)

Overall, I was able to run your same loop in about 13 minutes on 5 F2's with chunk size = 100 and maxTasksPerNode = 2 (i.e. a total of 10 workers).

I think this would be trivial to reduce to 8 minutes with a simple docker image change, and probably improve again if I added more workers and tweaked the chunk size. As a rule of thumb, a good starting point to optimize you chunk size is

chunkSize = totalTasks / workers

# so in your large cluster
chunkSize = 3264 / 3375 = ~1

# Since the result is ~1 I would actually recommend a smaller cluster 
# to get the chunk size close to 2 - 5.

Hopefully there are some changes there that can immediately improve your wall clock time (and use a smaller cluster). Please feel free to reach out with any other questions - we're happy to help out.

@unitroot - quick update:

I tried using the docker image I mentioned above and got this to run much more quickly ~4.4 minutes. See configuration and outputs below (note chunk size is 100 for simplicity). As expected, skipping the install caret step on each node significantly improves the overall run time.

Cluster config (5 F2's with 2 workers each)

{
  "name": "caret2",
  "vmSize": "Standard_F2",
  "maxTasksPerNode": 2,
  "poolSize": {
    "dedicatedNodes": {
      "min": 0,
      "max": 0
    },
    "lowPriorityNodes": {
      "min": 5,
      "max": 5
    },
    "autoscaleFormula": "QUEUE"
  },
  "containerImage": "jrowen/dcaret",
  "rPackages": {
    "cran": [],
    "github": [],
    "bioconductor": []
  },
  "commandLine": []
}

Output

=====================================================================================================
Id: job20180208172240
chunkSize: 100
enableCloudCombine: TRUE
packages: 
    caret; 
errorHandling: stop
wait: TRUE
autoDeleteJob: TRUE
=====================================================================================================
Submitting tasks (34/34)
Submitting merge task. . .Job Preparation Status: Package(s) being installed.

Waiting for tasks to complete. . .
| Progress: 100.00% (34/34) | Running: 0 | Queued: 0 | Completed: 34 | Failed: 0 ||
Tasks have completed. Merging results..
> ela <- Sys.time() - ela
> ela
Time difference of 4.385262 mins

My expectation is that having a total of 34 workers in this cluster would probably provide a near optimal run time.

Edit Scaling the cluster to 17 F2's with 2 workers each improved the time again down to ~1.4 minutes:

=====================================================================================================
Id: job20180208182839
chunkSize: 100
enableCloudCombine: TRUE
packages: 
    caret; 
errorHandling: stop
wait: TRUE
autoDeleteJob: TRUE
=====================================================================================================
Submitting tasks (34/34)
Submitting merge task. . .Job Preparation Status: Package(s) being installed.

Waiting for tasks to complete. . .
| Progress: 100.00% (34/34) | Running: 0 | Queued: 0 | Completed: 34 | Failed: 0 |
Tasks have completed. Merging results.
> ela <- Sys.time() - ela
> ela
Time difference of 1.406119 mins

@paselem Thank you very much. I was able to improve time considerably. However, even with more computing power than you, I was still clocking in at 3 min.

I assume tweaking the setup is the way to go from here:

{ "name": "xxx", "vmSize": "Standard_F16", "maxTasksPerNode": 64, "poolSize": { "dedicatedNodes": { "min": 1, "max": 1 }, "lowPriorityNodes": { "min": 15, "max": 15 }, "autoscaleFormula": "QUEUE" }, "containerImage": "jrowen/dcaret", "rPackages": { "cran": [], "github": [], "bioconductor": [] }, "commandLine": [] }

I am trying to come up with an equation that optimizes the set up. I started like this

(n_tasks / n_chunks) / (n_vm cpu_speed (1 / n_workers) * X) = time

What I am not yet clear about is the relationship between chunkSize, number of virtual machines, and number of workers.

In your documentation, it is stated that the maximum number of workers per machine is 4x times the number of cores. In terms of optimization, do you think it is better to have more workers doing more tasks simultaneously or fewer workers with bigger chunk sizes? Is this relationship linear? Would 10 VMs with 2 cores and 1 worker each and chunkSize 160 take the same time to complete 3200 tasks as 10 VMs with 8 cores and 4 workers each and chunkSize 10? Or are there benefits to using larger CPUs, more workers, etc.?

Hi @unitroot,

We are usually hesitant to provide too much guidance on the vm size to chunk size to maxTasksPerNode ratios since algorithms that run on our platform vary significantly. The amount of CPU vs. RAM vs. size of output are the main things that influence the total run time. If you have a CPU heavy algorithm, then you want to set the MaxTasksPerNode to the number of CPUs on the node so that each process can fully utilize the core it is running on. If you set it to greater than one, then multiple R processes are going to be competing for CPU time and probably end up swapping in and out of context and slowing down the overall run time (which is what I suspect is happening in you case). For RAM heavy processes, we recommend setting the maxTasksPerNode to 1 (or expectedRamRequirements / totalRam) so that each worker can get as much of the expected RAM as possible - so in this case, the CPU really doesn't matter. We've seen cases where people will just leave several cores on the machine idle because their process requires lots of RAM. Finally, the size of outputs should also be a consideration. I mentioned this before, but on average, fewer larger outputs tend to be more efficient that many smaller ones due to our current merging logic. Since we are currently limited to a single thread, the overall performance of downloading fewer files far exceeds lots of small ones.

Given that, we simply cannot recommend an algorithm that solves all scenarios. If your algorithm is CPU bound, I would try to match the number of cores to the number of tasks as closely as possible, or in increments of chunkSize = tasks / cores. So, if the maximum cores you're wanting to spin up is 100, then try to set your chunk size to a number where the number of tasks would be 100 or 200 or 300 (and so on). What this will do is have most tasks start and finish at roughly the same time, and not have one one set of tasks "holding up" the job.

Edit Another rule of thumb is that fewer larger machines are typically better than many smaller ones. Once you have found your 'optimal' settings, having 1 F16 is typically better than 2 F8 or 4 F4 machines (from a raw performance perspective). This is because the network bandwidth on larger machines tends to scale up better than multiple small machines and the cost overhead of the OS is amortized over more cores. That said, we do usually think that having more than 1 or 2 machines in the cluster is a good idea in case one of them goes down for some reason.

Thank you for your help, I will play around with the settings a bit!

@unitroot closing this as it seems to have been addressed. Please feel free to reopen if you find any other performance related issues.

Azure / doAzureParallel

Unsatisfactory Performance Due To Overhead? #217

Azure Test Environment

Cluster config (5 F2's with 2 workers each)

Output