Azure / doAzureParallel

A R package that allows users to submit parallel workloads in Azure
MIT License
107 stars 51 forks source link

Access terminal via cluster.json? #292

Closed ctlamb closed 6 years ago

ctlamb commented 6 years ago

One of the R packages I use requires some dependencies to be installed by source. On my local machine I did this long ago and now just load the R package as normal with library(). However, I will need to install these dependencies on the VM's before I can load this package. I am thinking I will need to do this via the cluster.json file (specifically the "commandLine" prompt). Does this seem correct?

The package I need is 'rgdal', which requires a short line of code to be run on the terminal to install gdal (see here for details) The code is sudo apt-get update && sudo apt-get install libgdal-dev libproj-dev

Would I change the cluster.json as follows?


{
  "name": "LambCluster",
  "vmSize": "Standard_F1",
  "maxTasksPerNode": 1,
  "poolSize": {
    "dedicatedNodes": {
      "min": 1,
      "max": 1
    },
    "lowPriorityNodes": {
      "min": 0,
      "max": 0
    },
    "autoscaleFormula": "QUEUE"
  },
  "containerImage": "rocker/tidyverse:latest",
  "rPackages": {
    "cran": [],
    "github": [],
    "bioconductor": []
  },
  "commandLine":{
"sudo apt-get update && sudo apt-get install libgdal-dev libproj-dev": []
}
  "subnetId": ""
}
ctlamb commented 6 years ago

I can confirm that the above cluster.json doesn't seem to work to help install rgdal.

Result with "rgdal" loaded, appears to fail:

rast.results <- foreach(i = 1,.packages = c("doParallel", "raster", "here", "rgdal"), github = c("Azure/doAzureParallel"), .errorhandling="pass") %dopar% {
+   
+   #doAzureParallel::setCredentials(credentials)
+   
+   return("rgdal" %in% installed.packages())
+   #return("raster" %in% installed.packages())
+ }
=============================================================================================================================================================================
Id: job20180731210257
chunkSize: 1
enableCloudCombine: TRUE
packages: 
    doParallel; raster; here; rgdal; 
githubPackages: 
    Azure/doAzureParallel; 
errorHandling: pass
wait: TRUE
autoDeleteJob: FALSE
=============================================================================================================================================================================
Submitting tasks (1/1)
Submitting merge task. . .
Job Preparation Status: Package(s) being installed..............
Waiting for tasks to complete. . .
| Progress: 100.00% (1/1) | Running: 0 | Queued: 0 | Completed: 1 | Failed: 1 |
Tasks have completed. Merging results....An error has occurred in the merge task of the job 'job20180731210257'. Error handling is set to 'stop' and has proceeded to terminate the job. The user will have to handle deleting the job. If this is not the correct behavior, change the errorhandling property to 'pass'  or 'remove' in the foreach object. Use the 'getJobFile' function to obtain the logs. For more information about getting job logs, follow this link: https://github.com/Azure/doAzureParallel/blob/master/docs/40-troubleshooting.md#viewing-files-directly-from-compute-nodeError in e$fun(obj, substitute(ex), parent.frame(), e$data) : 
  object 'results' not found

Result without rgdal, all seems to work fine:

> rast.results <- foreach(i = 1,.packages = c("doParallel", "raster", "here"), github = c("Azure/doAzureParallel"), .errorhandling="pass") %dopar% {
+   
+   #doAzureParallel::setCredentials(credentials)
+   
+   #return("rgdal" %in% installed.packages())
+   return("raster" %in% installed.packages())
+ }
=============================================================================================================================================================================
Id: job20180731211011
chunkSize: 1
enableCloudCombine: TRUE
packages: 
    doParallel; raster; here; 
githubPackages: 
    Azure/doAzureParallel; 
errorHandling: pass
wait: TRUE
autoDeleteJob: FALSE
=============================================================================================================================================================================
Submitting tasks (1/1)
Submitting merge task. . .
Job Preparation Status: Package(s) being installed............
Waiting for tasks to complete. . .
| Progress: 100.00% (1/1) | Running: 0 | Queued: 0 | Completed: 1 | Failed: 0 |
Tasks have completed. Merging results Completed.
> rast.results
$`1`
[1] TRUE
brnleehng commented 6 years ago

Hi ctlamb,

The commandLine property in the configuration runs on the host, not the docker environment. There's a couple of ways to fix this.

1) Find a dockerfile that suits your needs on docker hub such as rocker/geospatial This docker image has rdgal package already installed. For more information, https://hub.docker.com/r/rocker/geospatial/

Replace "containerImage": "rocker/tidyverse:latest" to "containerImage": "rocker/geospatial:latest",

2) Recommended route: Create your own docker image and push it to docker hub. This allows you to customize your docker image and include data and R packages with correct versions. However, the learning curve is you need to learn about docker. The benefits allows you to have reproducible environments.

Thanks, Brian

ctlamb commented 6 years ago

Amazing, thank you very much, Brian. These dockerfiles are great. Your "rocker/geospatial:latest" suggestion works perfect, and will work in the meantime as I look up the containerit package. Thanks again, much appreciated.