Closed simon-tarr closed 6 years ago
Hi @simon-tarr, So it appears you have hit the breaking change in force. My guess is that when you're using doAzureParallel you're not specifying a version, so you're set to pull the latest. Easiest and quickest fix is to use a known good version for you (last version we had for CentOS was 0.5.0) and that should get you up and running as before.
library(devtools)
devtools::install_github('azure/doAzureParallel', force = TRUE, ref = 'v0.5.0')
library(doAzureParallel)
Now, the latest change is actually quite significant and I suggest taking a look into it and trying it out if you have the time. What we have done is packaged up the R run-time in Docker containers. The main benefit of this is that we can quickly and easily update the version of R we use for doAzureParallel seamlessly, but more importantly you can do it as well. By pointing to a docker image, we can just use that to run. The major improvement to usability is that you can do a lot of testing on you local machine using that container and it will work the same in the cloud environment as it does on your local machine. There is a slight downside that you'll need to learn a bit about containers, but do ship a pretty complete container as our default which is packed full of packages. It unfortunately does not include netcdf, but we can definitely look into packaging it up as part of the default package. I'll do some additional testing today and see about posting some more detailed information for you.
Do you have a small snippet of code I could use to validate?
Hello, thank you for the reply.
I have already tried running my 'old' cluster.json and R script using parallel v0.5.0 and batch v0.5.1 (point 1 above) but I get the error "unknown input format". As I say, I have changed absolutely nothing on the machine I ran this from (I've been away the past 5 days so haven't had access).
I have installed the last CentOS version as per your instructions above and get the following back:
Job Summary:
Id: job2017xxxxxx
Job Preparation Status: Package(s) being installed
Waiting for tasks to complete. . .
|===============| 100%
unknown input format
With regards to containers; this does sound like a welcome improvement. I don't know much about them apart from the very basics but I will do some research this evening to figure it out in some more detail. If it is possible to package some libraries such as netcdf, that would be fantastic. It's a fairly standard data format and I'm sure many users would benefit from it as Azure and the package grows in popularity. Full information on the libraries can be found here: http://www.unidata.ucar.edu/software/netcdf/docs/index.html
With regards to code; I don't have any that I can easily share I'm afraid (this really goes against all R reproducibility!). The sticking point at the moment really seems to be these netcdf libraries failing to install...if there's anything you can do your end to install them, that would be great. I believe these are the various libraries for Ubuntu: https://packages.ubuntu.com/search?suite=xenial&searchon=names&keywords=netcdf I've tried many different combinations of these libraries and they all fail to install (using the new Debian/Ubuntu set up). Perhaps you can try running the cluster.json script I've copied above? I have copied it verbatim and it used to work.
That is very interesting... Do you know which version of doAzureParallel you were using previously? I'm a bit surprised that you're getting that error since those branches were stable for about a month. Might be worth trying v0.4.3 but that is just a shot in the dark from my end. I will try testing this out on a container though and see how far I can get.
I do. I wrote them down when I realised things stopped working so that I could roll back and forth to try and get things working. I was using doAzureParallel 0.5.0 and rAzureBatch 0.5.1. It's so bizarre that it has stopped working. When I saw the change log I assumed that even though I was using an old version of the packages that it would still boot up the new Ubuntu images. From what you've said above, it sounds like that's not the case?
For extra information: I have run the script locally and it works fine (I don't get the error). The error I get from doAzureParallel makes it sounds like it can't read the .nc files that go into the model for whatever reason.
With this change there are two breaking changes that are affecting you
I did change yum to apt-get but the start task always fails when I do this.
I did read some documentation here on commandline behaviour within containers but it was a little thin on the ground and my technical knowledge started failing me at that point! Would I need to somehow tell each node in the cluster to install the prerequisite libraries?
That is the whole purpose of the container actually. You would pre-package everything once, test it locally, and when you're happy with it, you would tell doAzureParallel to use that container, entirely removing the need to use the start task.
That makes sense - I guess it will greatly improve cluster boot times as well?
I'm currently reading this: https://github.com/Azure/doAzureParallel/blob/a6e51c964ec12bcf8c488ef94dee34b6d8f8be58/docs/30-customize-cluster.md
I require my loop to access some files within a blob - "unknown input format" error could be an access issue with the nodes not being able to access the necessary files?
Boot times will vary. Downloading the container takes a little bit of time (maybe a minute or two depending on the node size you're using - bigger nodes have better network bandwidth and can download much faster) but overall if you're installing multiple packages it should be a net win.
The blob access should happen within the container (I'm guessing you're using some of the doAzureParallel and rAzureBatch method to access the data, right?).
Yes, I'm using the doAzureParallel/rAzureBatch method. Specifically I'm specifying some resource files ahead of making my cluster. I then tell doAzureParallel::makeCluster where my files are with the argument 'resourceFiles'.
As my understanding of docker containers is quite limited at the moment, I was hoping you'd be able to help me with a quick question...
I found a geospatial docker image here: https://hub.docker.com/r/rocker/geospatial/~/dockerfile/ Could I simply set containerImage":rocker/geospatial:latest" in my cluster.json file to start this container up? I really hope so because it contains all the geospatial libraries that I need!
:). Yes. That is the point. As long as it has R installed, you should be good to go. I'm taking a quick look and it is built on top of rocker/verse:devel (which is one of the base images in our default) so this should work just fine from our end.
Well fingers crossed...I'm booting a pool now with this!
Oh, and quick note - just like in our versioning, you may want to lock yourself to a version to ensure compatibility. In Docker the sytanx ':latest' will always pull whatever the newest stable version is. You may want to use rocker/geospacial:3.4.1 or rocker/geospacial:3.4.2 instead. Note, you can find the versions in the tags tab.
Excellent, thank you. And yes, if I can get it all working then I'm absolutely locking in a version. Presumably I can specify additional packages not contained within the docker image via the usual method of
"rPackages": {
"cran": ["raster","dismo","ncdf4"],
"github": []
},
I have managed to boot up a pool using rocker/geospatial. However, I am still getting the error. I have been thinking with regards to distributing data and the instructions held here: https://github.com/Azure/doAzureParallel/blob/master/docs/21-distributing-data.md
Is pre-loaded data handled in the same fashion when using docker containers? For example, the 'old' method (pre-version 0.6.0) meant that files were kept in "$AZ_BATCH_NODE_STARTUP_DIR/wd". However, the instructions here (https://github.com/Azure/doAzureParallel/blob/master/docs/30-customize-cluster.md) specify that files are kept in "AZ_BATCH_NODE_ROOT_DIR".
If files are now kept in the new location, how do I tell my resource files to be put in ROOT_DIR as opposed to STARTUP/DIR? I hope that makes sense!
EDIT - I have checked out the merge-result.rds file which is generated after the failed run (error: "unknown input format"). This file says that "The specified resource does not exist". I'm now fairly confident that the new docker image is working (the ncdf4 package installed successfully according to stderr.txt) but that the container doesn't know how to handle pre-uploaded files when using 'createResourceFile'. I suspect that it's something to do with the two environment variables I specified above but I cannot be sure.
I have also tried updating the commandline in cluster.json with the instructions at the bottom of this page (https://github.com/Azure/doAzureParallel/blob/master/docs/30-customize-cluster.md) but I cannot get this to work also.
Any insight would be great.
The behavior didn't change. I think there may be some confusing docs in there though that I will update, and I definitely missed adding a note about the startup dir, so I can add that soon as well. I ran this code which is a subset from one of our samples and it worked by successfully listing all the files in the startup/wd directory.
azureStorageUrl <- "http://playdatastore.blob.core.windows.net/nyc-taxi-dataset"
resource_files <- list(
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-1.csv"), fileName = "yellow_tripdata_2016-1.csv"),
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-2.csv"), fileName = "yellow_tripdata_2016-2.csv"),
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-3.csv"), fileName = "yellow_tripdata_2016-3.csv"),
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-4.csv"), fileName = "yellow_tripdata_2016-4.csv"),
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-5.csv"), fileName = "yellow_tripdata_2016-5.csv"),
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-6.csv"), fileName = "yellow_tripdata_2016-6.csv"),
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-7.csv"), fileName = "yellow_tripdata_2016-7.csv"),
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-8.csv"), fileName = "yellow_tripdata_2016-8.csv"),
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-9.csv"), fileName = "yellow_tripdata_2016-9.csv"),
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-10.csv"), fileName = "yellow_tripdata_2016-10.csv"),
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-11.csv"), fileName = "yellow_tripdata_2016-11.csv"),
rAzureBatch::createResourceFile(url = paste0(azureStorageUrl, "/yellow_tripdata_2016-12.csv"), fileName = "yellow_tripdata_2016-12.csv"))
# set your credentials
doAzureParallel::setCredentials(credentialsFileName)
cluster <- doAzureParallel::makeCluster(clusterFileName, resourceFiles = resource_files)
doAzureParallel::registerDoAzureParallel(cluster)
res <-
foreach::foreach(i = 1:2) %dopar% {
# List the resources files on the node in the startup directory.
list.files(paste0(Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"), "/wd"))
}
Result
> res
[[1]]
[1] "cluster_setup.sh" "install_bioconductor.R" "yellow_tripdata_2016-1.csv" "yellow_tripdata_2016-10.csv" "yellow_tripdata_2016-11.csv"
[6] "yellow_tripdata_2016-12.csv" "yellow_tripdata_2016-2.csv" "yellow_tripdata_2016-3.csv" "yellow_tripdata_2016-4.csv" "yellow_tripdata_2016-5.csv"
[11] "yellow_tripdata_2016-6.csv" "yellow_tripdata_2016-7.csv" "yellow_tripdata_2016-8.csv" "yellow_tripdata_2016-9.csv"
[[2]]
[1] "cluster_setup.sh" "install_bioconductor.R" "yellow_tripdata_2016-1.csv" "yellow_tripdata_2016-10.csv" "yellow_tripdata_2016-11.csv"
[6] "yellow_tripdata_2016-12.csv" "yellow_tripdata_2016-2.csv" "yellow_tripdata_2016-3.csv" "yellow_tripdata_2016-4.csv" "yellow_tripdata_2016-5.csv"
[11] "yellow_tripdata_2016-6.csv" "yellow_tripdata_2016-7.csv" "yellow_tripdata_2016-8.csv" "yellow_tripdata_2016-9.csv"
Can you try something similar to verify that your files are where you expect them to be?
I can confirm that the necessary files are being found when I run the above script.
Given that they are in the correct location and none of my other code has changed, can you think of any other possible reasons why my loop is no longer running? I have combed through the code I'm using from GitHub (installed via cluster.json) and there's no error trap which returns the error "unknown input format". I can only assume therefore that doAzureParallel is generating this message.
I am happy to try anything at the moment; without this working I basically can't finish my PhD!
I've carried out some more troubleshooting this morning. Specifically I told the foreach loop to install from GitHub via 'GitHub=simontarr/NicheMapR' as per the instructions at https://github.com/Azure/doAzureParallel/blob/master/docs/20-package-management.md
The good news when doing this is that "unknown input error" disappears. Bad news is that it's replaced with a new one which is "argument is of length zero". I'm currently in the process of trying to determine which argument is supposed to be of length zero but not having too much luck at present.
EDIT - So I have managed to find the error logs from the failed loop. The message is always: "unable to find function micro_global". This is one of the key functions within the package. Why it cannot find it anymore, I do not know. The issue is present whether I install the package via startup (cluster.json) or when I install via the foreach loop. Here is a copy of the first failed task:
`[1] "argsList" "bioconductor" "cloudCombine"
[4] "enableCloudCombine" "exportenv" "expr"
[7] "github" "packages" "pkgName"
NULL
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] methods stats graphics grDevices utils datasets base
other attached packages: [1] ncdf4_1.16 dismo_1.1-4 raster_2.5-8 sp_1.2-4
loaded via a namespace (and not attached): [1] Rcpp_0.12.10 grid_3.3.3 lattice_0.20-35 <simpleError in eval(expr, envir, enclos): could not find function "micro_global"> No traceback available <simpleError in eval(expr, envir, enclos): could not find function "micro_global"> No traceback available <simpleError in eval(expr, envir, enclos): could not find function "micro_global"> No traceback available <simpleError in eval(expr, envir, enclos): could not find function "micro_global"> No traceback available Error Code: 1 `
I also installed three packages via .packages (ncdf4, raster and dismo). These appear under 'other attached packages' get the GitHub installed 'NicheMapR' does not appear. Here is the code I used for this loop:
testloop<-foreach(i = 1:nrow(obs.points), .options.azure = opt, .packages=c('raster','dismo','ncdf4'), github='mrke/NicheMapR', .errorhandling=c('pass')) %dopar% { micro<-micro_global(loc=obs.points[i,]) }
UPDATE - EVERYTHING IS NOW WORKING. Here's the situation...
Prior to doAzureParallel v0.6.0 it was necessary to install the netcdf libraries via command line in cluster.json. It was also required that I install all of my CRAN and GitHub packages via this method, too. When running my loop, I did not have to specify any CRAN packages with the .packages() argument in foreach. However, the loop would not run unless I specified .packages("NicheMapR"). In otherwords, .packages() has to contain the name of any package installed via GitHub using the old system.
Since v0.6.0 and the transition to docker containers I have had to do the following:
Find a docker image which contained all the necessary geospatial libraries. This was found at https://hub.docker.com/r/rocker/geospatial/~/dockerfile/
Boot up a pool but without specifiying any packages to install (CRAN or GitHub)
I have to install all necessary packages via .packages() and 'github=' as arguments in foreach
I must include library(NicheMapR) within my foreach loop. Without this line of code, the loop would not be able to locate any of the functions within the package installed via GitHub. Previously this was not necessary for packages installed within cluster.json on start up (If the package was specified in .packages() the loop knew how to locate the functions).
If I install my required packages within cluster.json, the loop won't work (various error messages including "unknown input format" and "argument is of length zero"). Even if I specify which packages are required within the loop, it still fails. I obviously don't know how you guys have coded this but my guess would be that GitHub packages are acting up.
@simon-tarr that is great news. Thanks for the detailed updates. Everything you've said there is accurate and I will spend time today updating the documentation. Unfortunately our breaking change from the command line did affect you and required a code change...
The way you are using .packages() and github on the foreach is 100% valid. If those requirements do not change between runs, I would recommend trying to package them into your own docker container for performance reasons. You would benefit from the container having everything you need, and not needing to wait to install the dependencies each time you run the foreach loop, which could add up over time.I would be interested in creating one for you. It would be an excellent example for us to add to our documentation. Is the following still a valid snippet of your foreach loop?
testloop<-foreach(i = 1:nrow(obs.points), .options.azure = opt,
.packages=c('raster','dismo','ncdf4'),
github='mrke/NicheMapR',
.errorhandling=c('pass')) %dopar% {
micro<-micro_global(loc=obs.points[i,])
}
EDIT The packages not installing correctly at the cluster level looks like a bug to me. I will also look into that.
Hey @paselem thanks for the reply. I thought the packages not installing at cluster level was potentially a bug as it was working as expected a few weeks ago.
The above code snippet is valid, yes. I provisioned that cluster with rocker/geospatial:3.3.3 The package' raster' is already contained within geospatial but the other packages aren't. A container containing 'dismo' and 'ncdf4' would certainly be useful. Do you know how packages on GitHub are handled in a similar fashion to those on CRAN? Is it possible to create a docker container with non-CRAN packages?
Hey @simon-tarr - yes, you can put anything you want into your docker image. Below I have copy pasted a dockerfile I created for your purpose. I would recommend take some time to read up on how to publish your docker image to docker hub after you've built it, then you can reference it in doAzureParallel and your runtimes should improve.
Here is the docker file. I created it in a directory called 'geospatial'
FROM rocker/geospatial:3.3.3
RUN Rscript -e 'install.packages(c("raster", "dismo", "ncdf4"))'
RUN Rscript -e 'devtools::install_github("mrke/NicheMapR")'
Once you have that (and docker installed) you can simply run
docker build geospatial -t <username>/geospacial:3.3.3
# (typically people use their usernames to prefix a package - similar to github projects)
Then you can view the images you've built by running
docker images
Finally, you can test it out locally by running the command below. It will get you into a R command prompt. From here you can test out some basic commands. This is the EXACT same environment that would be used by doAzureParallel if you pointed at this container in github
docker run --rm -it <username>/geospacial:3.3.3 R
Here I simply tested out loading the libraries you had in your foreach loop.
# The following would be inside the foreach loop
library(raster)
library(dismo)
library(ncdf4)
library(NicheMapR)
sessionInfo()
# do something else interesting
And this is the output. Note all libraries loaded successfully.
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] NicheMapR_1.1.3 ncdf4_1.16 dismo_1.1-4 raster_2.5-8
[5] sp_1.2-4
loaded via a namespace (and not attached):
[1] Rcpp_0.12.10 grid_3.3.3 lattice_0.20-35
Once you're satisfied with your container, you can push this to docker hub using
docker push <options>
Finally, you can reference that container in your cluster.json file
{
"name": "workstation",
"vmSize": "Standard_F16",
"maxTasksPerNode": 16,
"poolSize": {
"dedicatedNodes": {
"min": 0,
"max": 0
},
"lowPriorityNodes": {
"min": 8,
"max": 8
},
"autoscaleFormula": "QUEUE"
},
"containerImage": "<username>/geospatial:3.3.3",
"rPackages": {
"cran": [],
"github": [],
"bioconductor": [],
"githubAuthenticationToken": ""
},
"commandLine": []
}
Hello @paselem, thanks very much for the detailed instructions. I appreciate the time you've taken to help me out here, beyond the support for doAzureParallel and I'm very grateful. I will work on creating my own docker container today for use in my analyses; I'll let you know how I get on.
@simon-tarr, I was not able to reproduce the error where packages are not available if they are referenced in the cluster file. Here is my setup:
Cluster.json
{
"name": "package_management",
"vmSize": "Standard_A2_v2",
"maxTasksPerNode": 1,
"poolSize": {
"dedicatedNodes": {
"min": 0,
"max": 0
},
"lowPriorityNodes": {
"min": 1,
"max": 1
},
"autoscaleFormula": "QUEUE"
},
"rPackages": {
"cran": ["xml2"],
"github": ["azure/rAzureBatch"],
"bioconductor": ["GenomeInfoDb", "IRanges"],
"githubAuthenticationToken": ""
},
"commandLine": []
}
Code
#setup
cluster <- doAzureParallel::makeCluster("Cluster.json")
registerDoAzureParallel(cluster)
# run simple foreach loop
output <- foreach (i = 1:1) %dopar% { library(IRanges); library(xml2); x <- sessionInfo(); return(x); }
# print output
output
Output
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel methods stats graphics grDevices utils
[8] datasets base
other attached packages:
[1] xml2_1.1.1 IRanges_2.12.0 S4Vectors_0.16.0
[4] BiocGenerics_0.24.0
loaded via a namespace (and not attached):
[1] compiler_3.4.2 Rcpp_0.12.13
Note xml2 and IRanges are successfully loaded into the environment.
Hello,
Could someone please provide me with some information on the changes which have happened with the switch to Debian from CentOS? All of my previously working config files and scripts are now broken (again!). I am getting one of two problems, apparently depending on which Azure package versions I'm running:
All nodes boot without error and I am left with all nodes at 'idle' as expected (see cluster.json script below). However, after running a task which previously ran with the exact same settings/code/package versions I get an object returned with the error "unknown input format" (doAzureParallel v0.5.0 rAzureBatch 0.5.1).
I get 'start task failed' on every node in the pool since these backend OS changes. However, I receive no start task failure within R - unless I log into portal.azure.com, I'm none the wiser that all start tasks failed. This is using doAzureParallel v0.6.0 rAzureBatch v0.5.3 (Installed today [7th Nov 2017])
This was my previous (working!) cluster.json file. This file has been used in both points 1 & 2 above with no changes. There have also been no changes to my R scipt.
`{ "pool": { "name": "workstation", "vmSize": "Standard_F16", "maxTasksPerNode": 16, "poolSize": { "dedicatedNodes": { "min": 0, "max": 0 }, "lowPriorityNodes": { "min": 8, "max": 8 }, "autoscaleFormula": "QUEUE" } }, "rPackages": { "cran": ["raster","dismo","ncdf4"], "github": ["simon-tarr/NicheMapR"] },
"commandLine": [ "yum --assumeyes install -y netcdf", "yum --assumeyes install -y netcdf-devel" ]
} ` I have no idea what's causing this new error. It seems that every point release brings me quite substantial problems...I'm really struggling to get my work done as every time I boot up R to start an analysis, something has been changed and breaks. It's also costing me rather a lot of money continually booting up pools to carry out tests.
I suspect it's my commandline which is causing the issue but I cannot be sure. I have changed the commands from "yum --assumeyes install -y netcdf" to "apt-get install netcdf" but I get the error in stderr.txt:
Any help would be appreciated. Thank you.