Azure / doAzureParallel

A R package that allows users to submit parallel workloads in Azure
MIT License
107 stars 51 forks source link

Trapped in infinite package install #187

Closed unitroot closed 6 years ago

unitroot commented 6 years ago

I am currently evaluating Azure Batch for my employer. Trying your caret_example.R, I encountered two problems:

1) Booting nodes takes forever

Here's my current config file

{ "name": "caret", "vmSize": "Standard_F2", "maxTasksPerNode": 1, "poolSize": { "dedicatedNodes": { "min": 0, "max": 0 }, "lowPriorityNodes": { "min": 2, "max": 2 }, "autoscaleFormula": "QUEUE" }, "rPackages": { "cran": ["foreach", "doParallel"], "github": ["topepo/caret/pkg/caret"], "githubAuthenticationToken": "" }, "commandLine": [] }

Booting this rather simple batch takes 5 minutes. Is this normal or a consequence of me having a trial account?

2) Running the train function traps me in an infinite package install.

Job Preparation Status: Package(s) being installed......................................................

I abortet the thing after 20 minutes. Not sure what to do about this.

Here's my session info, if it's of any help: `R version 3.4.2 (2017-09-28) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252 LC_MONETARY=German_Austria.1252 [4] LC_NUMERIC=C LC_TIME=German_Austria.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] DAAG_1.22 caret_6.0-78 ggplot2_2.2.1 lattice_0.20-35
[5] doAzureParallel_0.6.2 iterators_1.0.8 foreach_1.4.3 devtools_1.13.3
[9] quanteda_0.99.22

loaded via a namespace (and not attached): [1] httr_1.3.1 ddalpha_1.3.1 tidyr_0.7.1 sfsmisc_1.1-1
[5] jsonlite_1.5 splines_3.4.2 prodlim_1.6.1 RcppParallel_4.3.20 [9] assertthat_0.2.0 stats4_3.4.2 latticeExtra_0.6-28 DRR_0.0.2
[13] yaml_2.1.14 robustbase_0.92-7 ipred_0.9-6 glue_1.1.1
[17] digest_0.6.12 RColorBrewer_1.1-2 randomForest_4.6-12 colorspace_1.3-2
[21] recipes_0.1.1 Matrix_1.2-11 plyr_1.8.4 psych_1.7.8
[25] timeDate_3012.100 pkgconfig_2.0.1 CVST_0.2-1 broom_0.4.2
[29] purrr_0.2.3 scales_0.5.0 gower_0.1.2 lava_1.5.1
[33] git2r_0.19.0 tibble_1.3.4 withr_2.1.0 nnet_7.3-12
[37] lazyeval_0.2.0 mnormt_1.5-5 mime_0.5 survival_2.41-3
[41] magrittr_1.5 memoise_1.1.0 nlme_3.1-131 MASS_7.3-47
[45] dimRed_0.1.0 foreign_0.8-69 class_7.3-14 tools_3.4.2
[49] data.table_1.10.4-1 stringr_1.2.0 kernlab_0.9-25 munsell_0.4.3
[53] bindrcpp_0.2 compiler_3.4.2 RcppRoll_0.2.2 rlang_0.1.2
[57] grid_3.4.2 RCurl_1.95-4.8 rstudioapi_0.7 rjson_0.2.15
[61] bitops_1.0-6 gtable_0.2.0 ModelMetrics_1.1.0 codetools_0.2-15
[65] curl_3.0 reshape2_1.4.2 R6_2.2.2 lubridate_1.6.0
[69] dplyr_0.7.4 bindr_0.1 fastmatch_1.1-0 stringi_1.1.5
[73] parallel_3.4.2 Rcpp_0.12.13 spacyr_0.9.3 rpart_4.1-11
[77] rAzureBatch_0.5.4 DEoptimR_1.0-8 tidyselect_0.2.2`

paselem commented 6 years ago

@unitroot - Regarding 1. - yes, clusters take 5 minutes to come up regardless of account type. The main target workflow for us is for jobs that take several minutes to days to run. We are constantly looking to drive down setup time, but the environment always takes some fixed time to set up.

Regard 2. We ran into similar issues where the Job Preparation took forever in the past on Ubuntu hosts, but this looks like you're running Windows. The issue was an authentication issue with one of our underlying packages, not with your job. I will test again and take a look and let you know what I find. Thanks for bringing it up.

brnleehng commented 6 years ago

@unitroot

2) I tried running the caret example with your pool configuration. I didn't run into infinite job installation , however I did get issues of running the foreach. It wasn't able to recognize the 'randomForest' package. Since 'randomForest' package is listed as a 'Suggested' R installation (we will not install it automatically), we need to add 'randomForest' to our package installation in the cluster object as below.

I am still looking at why waiting for job package installation is in an infinite loop.

# create a cluster object
caretCluster <- list(
  "name" = "caret",
  "vmSize" = "Standard_F2",
  "maxTasksPerNode" = 1,
  "poolSize" = list(
    "dedicatedNodes" = list(
      "min" = 2,
      "max" = 2
    ),
    "lowPriorityNodes" = list(
      "min" = 0,
      "max" = 0
    ),
    "autoscaleFormula" = "QUEUE"
  ),
  "containerImage" = "rocker/tidyverse:latest",
  "rPackages" = list(
    "cran" = list("foreach", "randomForest"),
    "github" = list("topepo/caret/pkg/caret"),
    "bioconductor" = list()
  ),
  "commandLine" = list()
)

# Creating an Azure parallel backend
cluster <- makeCluster(caretCluster)

Thanks! Brian

paselem commented 6 years ago

Equivalently in the json file it would look like

...
"rPackages": {
  "cran": ["foreach","randomForest"],
  "github": ["topepo/caret/pkg/caret"],
},
...
unitroot commented 6 years ago

I tried running the script again without any packages to be installed. Same problem. Region settings don't play into this, do they?

paselem commented 6 years ago

@unitroot, unfortunately we cannot reproduce this issue. My loop seems to run as expected.

Job Summary: 
Id: job20171212184300
Job Preparation Status: Package(s) being installed
Waiting for tasks to complete. . .
| Progress: 97.33% (73/75) | Running: 2 | Queued: 0 | Completed: 73 | Failed: 0 ||

In the past we saw that there were some authentication issues when checking for the job preparation status. Can you please enable verbose http logging so we can take a look at the traffic and identify if that is causing your issue?

# Add verbose logging before invoking Caret
doAzureParallel::setHttpTraffic(TRUE)
...
rf_fit <- train( ... )
unitroot commented 6 years ago

problem was installation-specific, could not find the error, but I am fine now