Closed roumail closed 6 years ago
@roumail thanks for bringing this up. Unfortunately I was not able to reproduce the issue. Here is my code
# Prepare cluster
workerClusterConfig <- list(
"name" = "worker",
"vmSize" = "Standard_F2",
"maxTasksPerNode" = 2,
"poolSize" = list(
"dedicatedNodes" = list("min" = 0, "max" = 0),
"lowPriorityNodes" = list("min" = 1, "max" = 3),
"autoscaleFormula" = "QUEUE"
),
"containerImage" = "rocker/tidyverse:latest",
"rPackages" = list(
"cran" = list(),
"github" = list(),
"bioconductor" = list()
),
"commandLine" = list()
)
workerCluster <- doAzureParallel::makeCluster(workerClusterConfig, wait = FALSE)
doAzureParallel::registerDoAzureParallel(workerCluster)
# Run job
result <- foreach(i = 1:10, .errorhandling = 'stop') %dopar% {
1
}
And here is my output:
> result <- foreach(i = 1:10, .errorhandling = 'stop') %dopar% {
+ 1
+ }
==================================================================================================================================================================
Id: job20180124182248
chunkSize: 1
enableCloudCombine: TRUE
errorHandling: stop
wait: TRUE
autoDeleteJob: TRUE
==================================================================================================================================================================
Submitting tasks (10/10)
Submitting merge task. . .
Waiting for tasks to complete. . .
| Progress: 100.00% (10/10) | Running: 0 | Queued: 0 | Completed: 10 | Failed: 0 |
Tasks have completed. Merging results..
Do you happen to have any other settings on your cluster? Also, can you provide more details on the error for example when it happened? Did the foreach fail out right away?
Thanks!
Hi @roumail
Can you also share your sessionInfo()?
Thanks, Brian
Hello @brnleehng, @paselem. Thank you for your responses. I have changed the title of my issue as I've tried a few things already and understand the problem better:
# Run job
result <- foreach(i = 1:10, .errorhandling = 'stop') %dopar% {
1
}
However, when I try my actual workflow, where my call to foreach %dopar% looks more like this:
# Actual job
result <- foreach(i = 1:10, .errorhandling = 'stop') %dopar% {
# call function that is defined in .GlobalEnv
out <- some_func_in_global_env()
out
}
Each of my task(workers) fail with the error that some_func_in_global_env
was not found. I found this quite odd since I expect that due to forking, I don't need to export variables from my global environment to the workers. However, looking at the workspace on one of my failing workers, I can clearly see that all the functions defined in my global environment are not available on the workers. The only things defined are a few Azure related variables (see screen shot):
I saved the session info and workspace image before the call to foreach as well in case that might help. Looking forward to your response and thanks in advance for any tips and suggestions!
Rohail
Please find attached a minimal example that recreates the error. The following files are available in this zip:
The errors I receive are that the ‘changepoint’ package is not found during the parallel execution or that the load_data() function from my global environment is not found.
Normally, I wouldn't expect this problem since we are 'forking', so I shouldn't need to pass my global environment via .export to foreach. These problems are really on the 'jobpreparation' step.
@roumail thanks for the detailed explanation and repro steps. From the surface I'm not exactly sure what the issue is but I have a hunch that we are not passing the environment in a way that the internals of the Rmd command expects us to. My initial guess is that you are installing the changepoint package on machine running LAUNCHER.R but it is not available in the cluster. Is changepoint included in your DOCKERFILE?
As a side question (not related to this issue), can you give me a bit of context around what you're calling registerDoSEQ() at the end of your script?
@roumail we have identified the issues and have local fixes for them.
There are basically two issues that were
result <- foreach::foreach(i = 1:10,
.export = c('load_data')) %dopar% {
# algorithm
}
For a complete sample of what I did, please take a look at the following
## Set up cluster
devtools::install_github('azure/doAzureParallel', ref = 'master')
library(doAzureParallel)
# set your credentials
credentials <- list(
...
)
doAzureParallel::setCredentials(credentials)
# set cluster config
workerClusterConfig <- list(
"name" = "rou",
"vmSize" = "Standard_F4",
"maxTasksPerNode" = 4,
"poolSize" = list(
"dedicatedNodes" = list("min" = 0, "max" = 0),
"lowPriorityNodes" = list("min" = 1, "max" = 1),
"autoscaleFormula" = "QUEUE"
),
"containerImage" = "roumail/g2n-env", #"rocker/tidyverse:latest",
"rPackages" = list(
"cran" = list(),
"github" = list(),
"bioconductor" = list()
),
"commandLine" = list()
)
workerCluster <- doAzureParallel::makeCluster(workerClusterConfig, wait = FALSE)
doAzureParallel::registerDoAzureParallel(workerCluster)
## Define Functions
load_data <- function() {
out <- iris
out
}
level_0 <- function() {
result <- foreach::foreach(i = 1:10,
.errorhandling = 'stop',
.export = c('load_data')
) %dopar% {
out <- load_data()
set.seed(1)
x <- c(rnorm(100, 0, 1), rnorm(100, 0, 10))
ansvar <- changepoint::cpt.var(x)
#length(ansvar)
}
c(result)
}
Finally, printing out the result I see the following (trimmed) output:
[[1]]
Class 'cpt' : Changepoint Object
~~ : S4 class containing 12 slots with names
cpttype date version data.set method test.stat pen.type pen.value minseglen cpts ncpts.max param.est
Created on : Wed Dec 13 15:28:54 2017
summary(.) :
----------
Created Using changepoint version 2.2.2
Changepoint type : Change in variance
Method of analysis : AMOC
Test Statistic : Normal
Type of penalty : MBIC with value, 15.89495
Minimum Segment Length : 2
Maximum no. of cpts : 1
Changepoint Locations : 100
[[2]]
...
Please keep an eye out for the #209 PR which should make it into master today.
Thanks!
Hi @paselem ,
Thanks for your responses! For the first problem, I was doing something like .export = ls(globalenv()) to get around the variable export problem. Feels a bit hacky, since if my global environment is big, it adds unnecessary overhead. Ofcourse, in this example, its easy to see load_data is the only variable that needs to be exported, but this could be difficult to figure out. I didn't expect that I needed to use the .export argument since this is normally needed for Psock clusters and not for forks if I'm not wrong? Are there plans in the future to implement something similar to the doFuture backend, which only exports what's needed?
As for the second answer, that really helps. I kept thinking there was something wrong with my image and had verified multiple times that all the packages are installed and loaded.
Thank you very much for the help. Looking forward to the latest update
Btw, devtools::install_github('azure/doAzureParallel', ref = 'master')
throws an error:
Downloading GitHub repo azure/doAzureParallel@master
from URL https://api.github.com/repos/azure/doAzureParallel/zipball/master
Installing doAzureParallel
Error in data.frame(package = package, installed = installed, available = available, :
row names contain missing values
Not sure if you faced the same. I will wait! Thanks again!
@roumail are you on Windows or Linux? We have seen intermittent issues on Windows machines where you need to install 'azure/rAzureBatch' prior to installing 'azure/doAzureParallel'. Can you give that a try?
Hi, I'm on Linux. I tried installing rAzureBatch first (succeeds) but still get the same error for doAzureParallel
@roumail Two things:
We have a PR out with the changes I described above. Before we merge them in, can you try to validate it locally and make sure it works?
devtools::install_github('azure/doazureparallel', ref='feature/loadLocalRSessionInfo')
Regarding the failing install, is this happening on the same machine you were using before? If not, is this a new box? Can you give me your sessioninfo?
sessionInfo()
Actually, I just remembered we had the install issue from another customer and it was resolved by upgrading the version of devtools. https://github.com/Azure/doAzureParallel/issues/203
Hi @paselem,
However, I notice that if I call functions in my parallel algorithm without appending it with the package::, the function from that package doesn't seem to be found.
To reproduce the error, in your script if you replace changepoint::cpt.var with simply cpt.var, the code should fail with the error on any worker:
[1] "argsList" "bioconductor" "cloudCombine"
[4] "enableCloudCombine" "exportenv" "expr"
[7] "github" "packages" "pkgName"
NULL
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux buster/sid
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets base
loaded via a namespace (and not attached):
[1] compiler_3.4.3
<simpleError in cpt.var(x): could not find function "cpt.var">
No traceback available
Error Code: 1
@roumail I think this is expected. Since we don't know what you have installed on your machine, you need to manually let us know by either doing
package::my_package.var()
or
library(my_package)
var()
I receive the following error when trying to run a parallel job. The job was changed to be trivial but still I receive the same issue:
Error: Authentication failed
Message: Server failed to authenticate the request. Make sure the value of the Authorization header is formed correctly including the signature.
Authentication detail: Signature did not match. String to sign used was w 2018-01-22T13:01:42Z 2018-01-25T13:01:42Z /blob/rohinfrasandssabatch901/$root 2016-05-31
You may think that the error is with the credentials I provide, but I am able to run the monte carlo simulation problem (https://azure.microsoft.com/en-us/blog/doazureparallel/) without any problem, so the error is probably not coming from my credentials.
Maybe I need to set some environment variables at the cluster startup or something? The code fails as soon as I get to the parallel algorithm part.
Please let me know if there is additional information you need. Any pointers would be very much appreciated.