Azure / doAzureParallel

A R package that allows users to submit parallel workloads in Azure
MIT License
107 stars 51 forks source link

installing packages through github #230

Closed kchaitanyabandi closed 6 years ago

kchaitanyabandi commented 6 years ago

Hi,

I am trying to train gbm and ranger using doAzureParallel backend with the train function of caret. But it gives me this error :

Error in names(resamples) <- gsub("^\.", "", names(resamples)) : attempt to set an attribute on NULL

This issue has been posted even on https://github.com/topepo/caret/issues/62 But I couldn't see any solution to that. I'm not sure if the problem exists with caret or doAzureParallel.

But then, I tried to install the development version of caret to see if the problem still persists. But I'm confused on how to install the github versions of R packages on the parallel processing nodes.

Could anyone please point to any documentation that talks about specifying the package names in "cluster.json" to install on the nodes from github? I entered the githubauthentication token the credentials.json file and mentioned the path of the package repository on github to install in against github : [ ] in "cluster.json", but I'm not sure if the packages are being installed from github.

I searched a lot on the web for the documentation but couldn't find it. So, had to break the rule of the issue template. I'm sorry. But help would be very very appreciated.

Example Code I'm running:

registerDoAzureParallel(azure_cluster_krishna)

gbmGrid <- expand.grid(interaction.depth = 5,
                           n.trees = 100,
                           shrinkage = 0.1,
                           n.minobsinnode = 10)

ctrl_gbm <- trainControl(method = "repeatedcv",
                               number = 2,
                               repeats = 1,
                               summaryFunction = multiClassSummary,
                               classProbs = TRUE,
                               verboseIter = TRUE)

  tuned_fit_gbm <- train(x = train_data[, names(train_data) != dependant_var],
                                        y = train_data[, names(train_data) == dependant_var],
                                        method = "gbm",
                                        verbose = TRUE,
                                        metric = metric_to_use,
                                        trControl = ctrl_gbm,
                                        tuneGrid = gbmGrid,
                                        weights = model_weights_to_use)

My Session Info:

R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] splines parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] doAzureParallel_0.6.2 gbm_2.1.3 survival_2.41-3 gower_0.1.2
[5] dimRed_0.1.0 DRR_0.0.2 CVST_0.2-1 Matrix_1.2-12
[9] kernlab_0.9-25 DEoptimR_1.0-8 ddalpha_1.3.1 sfsmisc_1.1-1
[13] robustbase_0.92-8 class_7.3-14 pkgconfig_2.0.1 glue_1.2.0
[17] bindrcpp_0.2 assertthat_0.2.0 RcppRoll_0.2.2 ModelMetrics_1.1.0
[21] lazyeval_0.2.1 munsell_0.4.3 mime_0.5 stringdist_0.9.4.6
[25] wavethresh_4.6.8 Metrics_0.1.3 Cubist_0.2.1 plyr_1.8.4
[29] ClusterR_1.0.9 gtools_3.5.0 cluster_2.0.6 lsr_0.5
[33] MASS_7.3-47 doSNOW_1.0.15 snow_0.4-2 ranger_0.8.0
[37] randomForest_4.6-12 mice_2.46.0 stringr_1.2.0 fscaret_0.9.4.1
[41] hmeasure_1.0 gsubfn_0.6-6 proto_1.0.0 caret_6.0-79
[45] ggplot2_2.2.1 lattice_0.20-35 doParallel_1.0.11 iterators_1.0.9
[49] foreach_1.4.4 dplyr_0.7.4 data.table_1.10.4-3

loaded via a namespace (and not attached): [1] colorspace_1.3-2 rjson_0.2.15 prodlim_1.6.1 lubridate_1.7.1 codetools_0.2-15
[6] mnormt_1.5-5 ade4_1.7-8 jsonlite_1.5 broom_0.4.3 png_0.1-7
[11] FD_1.0-12 shiny_1.0.5 compiler_3.4.3 httr_1.3.1 htmltools_0.3.6
[16] tools_3.4.3 gmp_0.5-13.1 gtable_0.2.0 reshape2_1.4.3 Rcpp_0.12.14
[21] gdata_2.18.0 ape_5.0 nlme_3.1-131 psych_1.7.8 timeDate_3042.101 [26] devtools_1.13.4 MLmetrics_1.1.1 scales_0.5.0 ipred_0.9-6 rAzureBatch_0.5.6 [31] curl_3.0 yaml_2.1.15 memoise_1.1.0 rpart_4.1-11 stringi_1.1.6
[36] e1071_1.6-8 permute_0.9-4 tiff_0.1-5 caTools_1.17.1 lava_1.5.1
[41] geometry_0.3-6 bitops_1.0-6 rlang_0.1.4 ROCR_1.0-7 purrr_0.2.4
[46] bindr_0.1 OpenImageR_1.0.7 recipes_0.1.2 tidyselect_0.2.3 magrittr_1.5
[51] R6_2.2.2 gplots_3.0.1 foreign_0.8-69 withr_2.1.1 mgcv_1.8-22
[56] RCurl_1.95-4.10 nnet_7.3-12 tibble_1.3.4 KernSmooth_2.23-15 xgboost_0.6.4.1
[61] jpeg_0.1-8 grid_3.4.3 vegan_2.4-5 digest_0.6.15 xtable_1.8-2
[66] tidyr_0.7.2 httpuv_1.3.5 stats4_3.4.3 magic_1.5-6 tcltk_3.4.3

brnleehng commented 6 years ago

Hey @kchaitanyabandi

Caret github package installation is somewhat different because they have their R project located in a subdirectory of the github repo. In this case, it's located in "~/pkg/caret" in https://github.com/topepo/caret.

The cluster configuration file for github installation needs a path for installing packages on every node. The most common github package installation is repo_name/project_name (For example, Azure/doAzureParallel). You only need github authentication token if you are using a private repo on github. Otherwise you can leave that blank.

I've added a sample cluster config file.

Here's a link to our documentation for github installation. https://github.com/Azure/doAzureParallel/tree/master/samples/package_management

I can also test it out if you have a sample data set Let me know if you have any more questions!

Thanks Brian

{
  "name": "caret-pool",
  "vmSize": "Standard_F2",
  "maxTasksPerNode": 2,
  "poolSize": {
    "dedicatedNodes": {
      "min": 0,
      "max": 0
    },
    "lowPriorityNodes": {
      "min": 2,
      "max": 2
    },
    "autoscaleFormula": "QUEUE"
  },
  "containerImage": "rocker/tidyverse:latest",
  "rPackages": {
    "cran": [],
    "github": ["topepo/caret/pkg/caret", "Azure/doAzureParallel"],
    "bioconductor": []
  },
  "commandLine": []
}
kchaitanyabandi commented 6 years ago

Hey @brnleehng

I used the sample cluster config file you commented and the latest development version of caret got installed. But, still the problem persists and the important thing is that it isn't working only for multi-class classification data. For Binary Classification and Regression, it is working perfectly fine.

I am not authorized to share my data because of confidentiality reasons, but you could try it with any data containing multiple class labels (>2). The code that I am trying to run is as follows.

ctrl_gbm <- trainControl(method = "repeatedcv",
                                         number = 10,
                                         repeats = 5,
                                         summaryFunction = multiClassSummary,
                                         classProbs = TRUE,
                                         verboseIter = TRUE)

gbmGrid <- expand.grid(nrounds = 100,
                                        max_depth = 5,
                                        eta = .05,
                                        gamma = 0,
                                        colsample_bytree = c(.6, .7),
                                        min_child_weight = 1,
                                        subsample = .8)

registerDoAzureParallel(mycluster)

tuned_fit_xgb <- train(x = xtrain,
                                    y = ytrain,
                                    method = "gbm",
                                    verbose = TRUE,
                                    metric = "logLoss",
                                    trControl = ctrl_gbm,
                                    tuneGrid = gbmGrid)
brnleehng commented 6 years ago

Hi @kchaitanyabandi

I was able to reproduce the error on the caret sample because of a missing R algorithm package (randomForest R package). To avoid missing any algorithm R packages, you can use a caret dockerfile that has a lot of the algorithm packages already installed.

https://hub.docker.com/r/jrowen/dcaret/~/dockerfile/

This includes ranger and glm R packages already installed.

Error: names(resamples) <- gsub("^\.", "", names(resamples)) : attempt to set an attribute on NULL

No traceback available In the cluster file, we referenced the docker image as **jrowen/dcaret:latest** in the containerImage property of json, shown below. ``` json { "name": "caret-pool", "vmSize": "Standard_F2", "maxTasksPerNode": 2, "poolSize": { "dedicatedNodes": { "min": 0, "max": 0 }, "lowPriorityNodes": { "min": 2, "max": 2 }, "autoscaleFormula": "QUEUE" }, "containerImage": "jrowen/dcaret:latest", "rPackages": { "cran": ["MLmetrics", "e1071"], "github": [], "bioconductor": [] }, "commandLine": [] } ``` Specifically, for multiClassSummary function, it appears there are missing packages for MLmetrics and e1071. I've also added these to the cluster configuration file. Thanks, Brian
kchaitanyabandi commented 6 years ago

Hey Brian,

You're awesome. It solved the issue. Thank you so much for the quick debug and reply. I just wanted to know how you checked for the error that said a package was missing.

Thanks Krishna

brnleehng commented 6 years ago

@kchaitanyabandi

Thanks for the response! We added the cluster config file to our sample #237.

Using the job id that's printed on the console, you can navigate through the Azure Portal or BatchLabs (Our tool for monitoring Batch jobs, maybe the easiest way to navigate).

By going to the job tab of the Azure Portal or BatchLabs, From there, there will be a list of tasks. By clicking one of the tasks, you will see a folder with stdout.txt, stderr.txt and the [The id of the task].txt

If you click on the [Id of the task].txt, you will get the R console output.

kchaitanyabandi commented 6 years ago

Hey @brnleehng

The fix you provided had worked for moment and then, again it gave me an error with the following grid I was using for Multi Class Classification.

gbmGrid <- expand.grid(interaction.depth = 10:20,
                           n.trees = c(100, 150, 200, 250, 300, 350, 400, 450, 500, 1000, 1175, 1250, 1300),
                           shrinkage = c(0.025, .05, .1, 0.2, 0.3),
                           n.minobsinnode = c(5:10, 20, 30))

ctrl_gbm <- trainControl(method = "repeatedcv",
                               number = 10,
                               repeats = 5,
                               summaryFunction = multiClassSummary,
                               classProbs = TRUE,
                               verboseIter = TRUE)

tuned_fit_gbm <- train(x = train_data[, names(train_data) != dep_var],
                         y = train_data[, names(train_data) == dep_var],
                         method = "gbm",
                         verbose = TRUE,
                         metric = "logLoss",
                         trControl = ctrl_gbm,
                         tuneGrid = gbmGrid)

The grid has a total of 22000 tasks it submitted to the Batch Pool and the same error

Error in names(resamples) <- gsub("^\.", "", names(resamples)) : attempt to set an attribute on NULL

popped up after 239 successfully completed tasks. I wonder why, but I need to check the logs using the method you suggested using Batch Labs application.

I'm in the process of getting the login credentials for the Azure Batch Account. In the meanwhile, is there any other method using which I can see those files from R console ?

brnleehng commented 6 years ago

Hi @kchaitanyabandi

To get files from the job, we have a getJobFile api. We have some documentation on getting files from the node. Link

In order to get the logs, you'll need to get the job id and the task that failed. You'll want to get the [Task Id].txt (For example, 1.txt) because that'll contain the logs from your R program.

# Get the logs from task 1 that was run by R
taskLogs <- getJobFile("job20180322051216", "1", "wd/1.txt")
cat(taskLogs)

# Get the stdout output from task 2
stdoutLogs <- getJobFile("job20180322051216", "2", "stdout.txt")
cat(stdout)

Here's a structure of the job directory on the node

Thanks, Brian

kchaitanyabandi commented 6 years ago

Hey Brian,

I checked the error logs of the failed tasks, and I couldn't quite understand what might have gone wrong. The following is the output that says Error Code: 0 at the bottom. Could you please share any insight that might indicate what might have gone wrong ?

[1] "argsList"           "bioconductor"       "cloudCombine"      
[4] "enableCloudCombine" "exportenv"          "expr"              
[7] "github"             "packages"           "pkgName"           
[1] "caret"
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  base     

other attached packages:
[1] caret_6.0-79    ggplot2_2.2.1   lattice_0.20-35

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.4   purrr_0.2.4        reshape2_1.4.3     kernlab_0.9-25    
 [5] splines_3.4.2      colorspace_1.3-2   stats4_3.4.2       survival_2.41-3   
 [9] prodlim_1.6.1      rlang_0.2.0        ModelMetrics_1.1.0 pillar_1.2.1      
[13] foreign_0.8-69     glue_1.2.0         withr_2.1.2        bindrcpp_0.2      
[17] foreach_1.4.4      bindr_0.1          plyr_1.8.4         dimRed_0.1.0      
[21] lava_1.5.1         robustbase_0.92-8  stringr_1.3.0      timeDate_3043.102 
[25] munsell_0.4.3      gtable_0.2.0       recipes_0.1.2      codetools_0.2-15  
[29] psych_1.7.8        parallel_3.4.2     class_7.3-14       DEoptimR_1.0-8    
[33] broom_0.4.3        methods_3.4.2      Rcpp_0.12.16       scales_0.5.0      
[37] ipred_0.9-6        CVST_0.2-1         mnormt_1.5-5       stringi_1.1.7     
[41] dplyr_0.7.4        RcppRoll_0.2.2     ddalpha_1.3.1.1    grid_3.4.2        
[45] tools_3.4.2        magrittr_1.5       lazyeval_0.2.0     tibble_1.4.2      
[49] tidyr_0.8.0        DRR_0.0.2          pkgconfig_2.0.1    MASS_7.3-47       
[53] Matrix_1.2-11      lubridate_1.7.3    gower_0.1.2        assertthat_0.2.0  
[57] iterators_1.0.8    R6_2.2.2           rpart_4.1-11       sfsmisc_1.1-2     
[61] nnet_7.3-12        nlme_3.1-131       compiler_3.4.2    
+ Fold01.Rep1: shrinkage=0.100, interaction.depth=18, n.minobsinnode= 6, n.trees=1300 
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.0986            -nan     0.1000    0.1951
     2        0.9683            -nan     0.1000    0.1396
     3        0.8747            -nan     0.1000    0.1040
     4        0.8033            -nan     0.1000    0.0821
     5        0.7463            -nan     0.1000    0.0658
     6        0.7020            -nan     0.1000    0.0546
     7        0.6636            -nan     0.1000    0.0442
     8        0.6314            -nan     0.1000    0.0367
     9        0.6049            -nan     0.1000    0.0349
    10        0.5803            -nan     0.1000    0.0279
    20        0.4431            -nan     0.1000    0.0082
    40        0.3491            -nan     0.1000    0.0029
    60        0.3006            -nan     0.1000   -0.0001
    80        0.2667            -nan     0.1000    0.0005
   100        0.2443            -nan     0.1000   -0.0007
   120        0.2280            -nan     0.1000   -0.0010
   140        0.2145            -nan     0.1000   -0.0007
   160        0.2043            -nan     0.1000   -0.0007
   180        0.1957            -nan     0.1000   -0.0015
   200        0.1885            -nan     0.1000   -0.0011
   220        0.1823            -nan     0.1000   -0.0021
   240        0.1768            -nan     0.1000   -0.0017
   260        0.1727            -nan     0.1000   -0.0019
   280        0.1689            -nan     0.1000   -0.0011
   300        0.1657            -nan     0.1000   -0.0020
   320        0.1627            -nan     0.1000   -0.0019
   340        0.1604            -nan     0.1000   -0.0017
   360        0.1581            -nan     0.1000   -0.0018
   380        0.1559            -nan     0.1000   -0.0015
   400        0.1541            -nan     0.1000   -0.0015
   420        0.1523            -nan     0.1000   -0.0013
   440        0.1508            -nan     0.1000   -0.0018
   460        0.1491            -nan     0.1000   -0.0022
   480        0.1477            -nan     0.1000   -0.0019
   500        0.1463            -nan     0.1000   -0.0015
   520        0.1452            -nan     0.1000   -0.0020
   540        0.1440            -nan     0.1000   -0.0015
   560        0.1432            -nan     0.1000   -0.0011
   580        0.1418            -nan     0.1000   -0.0019
   600        0.1410            -nan     0.1000   -0.0013
   620        0.1399            -nan     0.1000   -0.0018
   640        0.1390            -nan     0.1000   -0.0016
   660        0.1382            -nan     0.1000   -0.0015
   680        0.1372            -nan     0.1000   -0.0014
   700        0.1366            -nan     0.1000   -0.0012
   720        0.1359            -nan     0.1000   -0.0015
   740        0.1352            -nan     0.1000   -0.0017
   760        0.1345            -nan     0.1000   -0.0016
   780        0.1339            -nan     0.1000   -0.0016
   800        0.1333            -nan     0.1000   -0.0017
   820        0.1327            -nan     0.1000   -0.0015
   840        0.1321            -nan     0.1000   -0.0023
   860        0.1316            -nan     0.1000   -0.0017
   880        0.1313            -nan     0.1000   -0.0013
   900        0.1306            -nan     0.1000   -0.0023
   920        0.1302            -nan     0.1000   -0.0018
   940        0.1298            -nan     0.1000   -0.0017
   960        0.1292            -nan     0.1000   -0.0019
   980        0.1289            -nan     0.1000   -0.0019
  1000        0.1284            -nan     0.1000   -0.0016
  1020        0.1282            -nan     0.1000   -0.0017
  1040        0.1279            -nan     0.1000   -0.0024
  1060        0.1275            -nan     0.1000   -0.0019
  1080        0.1273            -nan     0.1000   -0.0022
  1100        0.1270            -nan     0.1000   -0.0014
  1120        0.1266            -nan     0.1000   -0.0016
  1140        0.1264            -nan     0.1000   -0.0019
  1160        0.1261            -nan     0.1000   -0.0014
  1180        0.1257            -nan     0.1000   -0.0015
  1200        0.1254            -nan     0.1000   -0.0019
  1220        0.1251            -nan     0.1000   -0.0025
  1240        0.1249            -nan     0.1000   -0.0019
  1260        0.1244            -nan     0.1000   -0.0012
  1280        0.1241            -nan     0.1000   -0.0018
  1300        0.1239            -nan     0.1000   -0.0018

- Fold01.Rep1: shrinkage=0.100, interaction.depth=18, n.minobsinnode= 6, n.trees=1300 
Error Code: 0
brnleehng commented 6 years ago

Looks like there's no errors occurred based on the logs..

Does BatchLabs say the tasks have errors? Can you look at the stderr.txt and stdout.txt files? Are the ValidDeviance '-nan' a valid answer?

If you have a sample dataset (I'm having a tough job, finding a dataset) and a working sample, that I can use to reproduce the problem. That'll be helpful.

Brian

kchaitanyabandi commented 6 years ago

Hey Brian,

Please send me your email id to me on bandi014@umn.edu and I'll send you the sample dataset.

Thanks Krishna

brnleehng commented 6 years ago

Working through via offline