Azure / doAzureParallel

A R package that allows users to submit parallel workloads in Azure
MIT License
107 stars 51 forks source link

doazureparallel failing to load on certain nodes #295

Open ctlamb opened 6 years ago

ctlamb commented 6 years ago

I'm in the middle of running a big job: 200 VMs, 800 tasks. So far 500 tasks have completed but 120 have failed. I looked into the failures and can see that the stderr.txt files for failed nodes indicate doazureparallel failed to load.

stderr for failed job: running

  '/usr/local/lib/R/bin/R --slave --no-restore --no-save --no-environ --no-restore --no-site-file --file=/mnt/batch/tasks/workitems/occpred09082018/job-1/jobpreparation/wd/worker.R --args 291 291 0 pass'

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
here() starts at /mnt/batch/tasks/workitems/occpred09082018/job-1/291/wd
Loading required package: raster
Loading required package: sp
Loading required package: survival
Loading required package: lattice
Loading required package: splines
Loaded gbm 2.1.3

Attaching package: ‘snow’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
    parCapply, parLapply, parRapply, parSapply, splitIndices,
    stopCluster

Error in library(packageName, character.only = TRUE) : 
  there is no package called ‘doAzureParallel’
Execution halted

But then hundreds of the jobs worked, and produced the following with no errors.

running
  '/usr/local/lib/R/bin/R --slave --no-restore --no-save --no-environ --no-restore --no-site-file --file=/mnt/batch/tasks/workitems/occpred09082018/job-1/jobpreparation/wd/worker.R --args 275 275 0 pass'

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
here() starts at /mnt/batch/tasks/workitems/occpred09082018/job-1/275/wd
Loading required package: raster
Loading required package: sp
Loading required package: survival
Loading required package: lattice
Loading required package: splines
Loaded gbm 2.1.3

Attaching package: ‘snow’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
    parCapply, parLapply, parRapply, parSapply, splitIndices,
    stopCluster

Attaching package: ‘doAzureParallel’

The following objects are masked from ‘package:snow’:

    makeCluster, stopCluster

The following object is masked from ‘package:raster’:

    getCluster

The following objects are masked from ‘package:parallel’:

    makeCluster, stopCluster
brnleehng commented 6 years ago

Hi @ctlamb

Are you running the installation for doAzureParallel on the cluster config installation or in the foreach?

Thanks, Brian

ctlamb commented 6 years ago

In the foreach

  rast.results <- foreach(i = 1:nrow(bp),.packages = c("doParallel", "here", "dismo", "gbm", "snow"),
                        github = c("Azure/doAzureParallel"), .errorhandling="pass",
                        .options.azure = list(enableCloudCombine=FALSE,
                                              job = job_name)) %dopar% {

This is ClusterConfig


clusterConfig <- list(
  "name" = "LambRaster",
  "vmSize" = "Standard_D12_v2",
  "maxTasksPerNode" = 1,
  "poolSize" = list(
    "dedicatedNodes" = list(
      "min" = 1,
      "max" = 200
    ),
    "lowPriorityNodes" = list(
      "min" = 0,
      "max" = 0
    ),
    "autoscaleFormula" = "QUEUE"
  ),
  "containerImage" = "rocker/geospatial:latest",
  "rPackages" = list(
    "cran" = list(),
    "github" = list(),
    "bioconductor" = list()
  ),
  "commandLine" = list()
)
brnleehng commented 6 years ago

I would recommend installing the R packages on the cluster configuration level so you don't need to install every single job. Also the job will not start if the start tasks of the cluster have failed.

clusterConfig <- list(
  "name" = "LambRaster",
  "vmSize" = "Standard_D12_v2",
  "maxTasksPerNode" = 1,
  "poolSize" = list(
    "dedicatedNodes" = list(
      "min" = 1,
      "max" = 200
    ),
    "lowPriorityNodes" = list(
      "min" = 0,
      "max" = 0
    ),
    "autoscaleFormula" = "QUEUE"
  ),
  "containerImage" = "rocker/geospatial:latest",
  "rPackages" = list(
    "cran" = list("doParallel", "here", "dismo", "gbm", "snow"),
    "github" = list("Azure/doAzureParallel"),
    "bioconductor" = list()
  ),
  "commandLine" = list()
)

Move the doAzureParallel package name into the regular .packages vector.

  rast.results <- foreach(i = 1:nrow(bp),.packages = c("doParallel", "here", "dismo", "gbm", "snow", "doAzureParallel"), .errorhandling="pass",
                        .options.azure = list(enableCloudCombine=FALSE,
                                              job = job_name)) %dopar% {

I'll need to see the logs from the job preparation tasks of the batch node. However, the getClusterFile does not work for job preparation tasks. I've created a separate issue for this.

If you have the portal for Azure Batch portal, you can go to: Batch Pools > (Name of your pool) > Nodes > Click on the node > in the search bar "/workitems//job-1/jobpreparation/stderr.txt"

Thanks, Brian

ctlamb commented 6 years ago

Thanks, @brnleehng this makes better sense.

I used the clusterConfig you made above (plus some debugging of my own after) but it seems to produce an error, which I can confirm is not present when I run without loading the packages in the clusterConfig

=======================================================================================================================================================================================
Name: LambRaster
Configuration:
    Docker Image: rocker/geospatial:latest
    MaxTasksPerNode: 1
    Node Size: Standard_D12_v2
cranPackages: 
    Error in cat(list(...), file, sep, fill, labels, append) : 
  argument 1 (type 'list') cannot be handled by 'cat'
brnleehng commented 6 years ago

Hi @ctlamb

It appears the cluster config file programmatically. Takes a character vector instead of a list for the R packages parameter, I'll update the docs for clarification.

clusterConfig <- list(
  "name" = "LambRaster",
  "vmSize" = "Standard_D12_v2",
  "maxTasksPerNode" = 1,
  "poolSize" = list(
    "dedicatedNodes" = list(
      "min" = 1,
      "max" = 200
    ),
    "lowPriorityNodes" = list(
      "min" = 0,
      "max" = 0
    ),
    "autoscaleFormula" = "QUEUE"
  ),
  "containerImage" = "rocker/geospatial:latest",
  "rPackages" = list(
    "cran" = c("doParallel", "here", "dismo", "gbm", "snow"),
    "github" = c("Azure/doAzureParallel"),
    "bioconductor" = c()
  ),
  "commandLine" = list()
)

Thanks, Brian

ctlamb commented 5 years ago

Awesome, this is solved, thanks!