Azure / doAzureParallel

A R package that allows users to submit parallel workloads in Azure
MIT License
107 stars 51 forks source link

Workers don't contain functions defined in .GlobalEnv and .Rprofile from container image ignored #205

Closed roumail closed 6 years ago

roumail commented 6 years ago

I receive the following error when trying to run a parallel job. The job was changed to be trivial but still I receive the same issue:

result <- foreach(i = 1:10, .errorhandling = 'stop') %dopar% {
    1
}

Error: Authentication failed

Message: Server failed to authenticate the request. Make sure the value of the Authorization header is formed correctly including the signature.

Authentication detail: Signature did not match. String to sign used was w 2018-01-22T13:01:42Z 2018-01-25T13:01:42Z /blob/rohinfrasandssabatch901/$root 2016-05-31

You may think that the error is with the credentials I provide, but I am able to run the monte carlo simulation problem (https://azure.microsoft.com/en-us/blog/doazureparallel/) without any problem, so the error is probably not coming from my credentials.

Maybe I need to set some environment variables at the cluster startup or something? The code fails as soon as I get to the parallel algorithm part.

Please let me know if there is additional information you need. Any pointers would be very much appreciated.

paselem commented 6 years ago

@roumail thanks for bringing this up. Unfortunately I was not able to reproduce the issue. Here is my code


# Prepare cluster
workerClusterConfig <- list(
  "name" = "worker",
  "vmSize" = "Standard_F2",
  "maxTasksPerNode" = 2,
  "poolSize" = list(
    "dedicatedNodes" = list("min" = 0, "max" = 0),
    "lowPriorityNodes" = list("min" = 1, "max" = 3),
    "autoscaleFormula" = "QUEUE"
  ),
  "containerImage" = "rocker/tidyverse:latest",
  "rPackages" = list(
    "cran" = list(),
    "github" = list(),
    "bioconductor" = list()
  ),
  "commandLine" = list()
)

workerCluster <- doAzureParallel::makeCluster(workerClusterConfig, wait = FALSE)
doAzureParallel::registerDoAzureParallel(workerCluster)

# Run job
result <- foreach(i = 1:10, .errorhandling = 'stop') %dopar% {
  1
}

And here is my output:

> result <- foreach(i = 1:10, .errorhandling = 'stop') %dopar% {
+   1
+ }
==================================================================================================================================================================
Id: job20180124182248
chunkSize: 1
enableCloudCombine: TRUE
errorHandling: stop
wait: TRUE
autoDeleteJob: TRUE
==================================================================================================================================================================
Submitting tasks (10/10)
Submitting merge task. . .
Waiting for tasks to complete. . .
| Progress: 100.00% (10/10) | Running: 0 | Queued: 0 | Completed: 10 | Failed: 0 |
Tasks have completed. Merging results..

Do you happen to have any other settings on your cluster? Also, can you provide more details on the error for example when it happened? Did the foreach fail out right away?

Thanks!

brnleehng commented 6 years ago

Hi @roumail

Can you also share your sessionInfo()?

Thanks, Brian

roumail commented 6 years ago

Hello @brnleehng, @paselem. Thank you for your responses. I have changed the title of my issue as I've tried a few things already and understand the problem better:

image

I saved the session info and workspace image before the call to foreach as well in case that might help. Looking forward to your response and thanks in advance for any tips and suggestions!

Rohail

roumail commented 6 years ago

Please find attached a minimal example that recreates the error. The following files are available in this zip:

The errors I receive are that the ‘changepoint’ package is not found during the parallel execution or that the load_data() function from my global environment is not found.

Normally, I wouldn't expect this problem since we are 'forking', so I shouldn't need to pass my global environment via .export to foreach. These problems are really on the 'jobpreparation' step.

azure_test.zip

paselem commented 6 years ago

@roumail thanks for the detailed explanation and repro steps. From the surface I'm not exactly sure what the issue is but I have a hunch that we are not passing the environment in a way that the internals of the Rmd command expects us to. My initial guess is that you are installing the changepoint package on machine running LAUNCHER.R but it is not available in the cluster. Is changepoint included in your DOCKERFILE?

As a side question (not related to this issue), can you give me a bit of context around what you're calling registerDoSEQ() at the end of your script?

paselem commented 6 years ago

@roumail we have identified the issues and have local fixes for them.

There are basically two issues that were

  1. doParallel doesn't do the best job at resolving function dependencies. You need to manually pass them into the foreach function using the .export command so that they are available in the cluster.
result <- foreach::foreach(i = 1:10,
                             .export = c('load_data')) %dopar% {
# algorithm
}
  1. doAzureParallel had an issue where the resolution of packages at package install time and script run time was inconsistent. In your specific case, the 'changepoint' package was already available in your image in a path that was read by the package installer code (defined in ~/.RProfile) but was not read by the run time code. We have a fix #209 out for review to address the issue.

For a complete sample of what I did, please take a look at the following

## Set up cluster
devtools::install_github('azure/doAzureParallel', ref = 'master')
library(doAzureParallel)

# set your credentials
credentials <- list(
  ...
)
doAzureParallel::setCredentials(credentials)

# set cluster config
workerClusterConfig <- list(
  "name" = "rou",
  "vmSize" = "Standard_F4",
  "maxTasksPerNode" = 4,
  "poolSize" = list(
    "dedicatedNodes" = list("min" = 0, "max" = 0),
    "lowPriorityNodes" = list("min" = 1, "max" = 1),
    "autoscaleFormula" = "QUEUE"
  ),
  "containerImage" = "roumail/g2n-env", #"rocker/tidyverse:latest",
  "rPackages" = list(
    "cran" = list(),
    "github" = list(),
    "bioconductor" = list()
  ),
  "commandLine" = list()
)

workerCluster <- doAzureParallel::makeCluster(workerClusterConfig, wait = FALSE)
doAzureParallel::registerDoAzureParallel(workerCluster)

## Define Functions
load_data <- function() {
  out <- iris
  out

}

level_0 <- function() {
  result <- foreach::foreach(i = 1:10,
                             .errorhandling = 'stop',
                             .export = c('load_data')
  ) %dopar% {

    out <- load_data()

    set.seed(1)
    x <- c(rnorm(100, 0, 1), rnorm(100, 0, 10))
    ansvar <- changepoint::cpt.var(x)
    #length(ansvar)
  }
  c(result)
}

Finally, printing out the result I see the following (trimmed) output:

[[1]]
Class 'cpt' : Changepoint Object
       ~~   : S4 class containing 12 slots with names
              cpttype date version data.set method test.stat pen.type pen.value minseglen cpts ncpts.max param.est 

Created on  : Wed Dec 13 15:28:54 2017 

summary(.)  :
----------
Created Using changepoint version 2.2.2 
Changepoint type      : Change in variance 
Method of analysis    : AMOC 
Test Statistic  : Normal 
Type of penalty       : MBIC with value, 15.89495 
Minimum Segment Length : 2 
Maximum no. of cpts   : 1 
Changepoint Locations : 100 

[[2]]
...

Please keep an eye out for the #209 PR which should make it into master today.

Thanks!

roumail commented 6 years ago

Hi @paselem ,

Thanks for your responses! For the first problem, I was doing something like .export = ls(globalenv()) to get around the variable export problem. Feels a bit hacky, since if my global environment is big, it adds unnecessary overhead. Ofcourse, in this example, its easy to see load_data is the only variable that needs to be exported, but this could be difficult to figure out. I didn't expect that I needed to use the .export argument since this is normally needed for Psock clusters and not for forks if I'm not wrong? Are there plans in the future to implement something similar to the doFuture backend, which only exports what's needed?

As for the second answer, that really helps. I kept thinking there was something wrong with my image and had verified multiple times that all the packages are installed and loaded.

Thank you very much for the help. Looking forward to the latest update

roumail commented 6 years ago

Btw, devtools::install_github('azure/doAzureParallel', ref = 'master') throws an error:

Downloading GitHub repo azure/doAzureParallel@master
from URL https://api.github.com/repos/azure/doAzureParallel/zipball/master
Installing doAzureParallel
Error in data.frame(package = package, installed = installed, available = available,  : 
  row names contain missing values

Not sure if you faced the same. I will wait! Thanks again!

paselem commented 6 years ago

@roumail are you on Windows or Linux? We have seen intermittent issues on Windows machines where you need to install 'azure/rAzureBatch' prior to installing 'azure/doAzureParallel'. Can you give that a try?

roumail commented 6 years ago

Hi, I'm on Linux. I tried installing rAzureBatch first (succeeds) but still get the same error for doAzureParallel

paselem commented 6 years ago

@roumail Two things:

  1. We have a PR out with the changes I described above. Before we merge them in, can you try to validate it locally and make sure it works?

    devtools::install_github('azure/doazureparallel', ref='feature/loadLocalRSessionInfo')
  2. Regarding the failing install, is this happening on the same machine you were using before? If not, is this a new box? Can you give me your sessioninfo?

    sessionInfo()
paselem commented 6 years ago

Actually, I just remembered we had the install issue from another customer and it was resolved by upgrading the version of devtools. https://github.com/Azure/doAzureParallel/issues/203

roumail commented 6 years ago

Hi @paselem,

203 fixes the installation issue. Running the script you posted earlier, with the updated version of doAzureParallel, the clusters seem to have access to all the libraries I installed on my docker image!

However, I notice that if I call functions in my parallel algorithm without appending it with the package::, the function from that package doesn't seem to be found.

To reproduce the error, in your script if you replace changepoint::cpt.var with simply cpt.var, the code should fail with the error on any worker:

[1] "argsList"           "bioconductor"       "cloudCombine"      
[4] "enableCloudCombine" "exportenv"          "expr"              
[7] "github"             "packages"           "pkgName"           
NULL
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux buster/sid

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  base     

loaded via a namespace (and not attached):
[1] compiler_3.4.3
<simpleError in cpt.var(x): could not find function "cpt.var">
No traceback available 
Error Code: 1
paselem commented 6 years ago

@roumail I think this is expected. Since we don't know what you have installed on your machine, you need to manually let us know by either doing

package::my_package.var()

or

library(my_package)
var()