installing packages through github #230

kchaitanyabandi commented 6 years ago


I am trying to train gbm and ranger using doAzureParallel backend with the train function of caret. But it gives me this error :

Error in names(resamples) <- gsub("^\.", "", names(resamples)) : attempt to set an attribute on NULL

This issue has been posted even on But I couldn't see any solution to that. I'm not sure if the problem exists with caret or doAzureParallel.

But then, I tried to install the development version of caret to see if the problem still persists. But I'm confused on how to install the github versions of R packages on the parallel processing nodes.

Could anyone please point to any documentation that talks about specifying the package names in "cluster.json" to install on the nodes from github? I entered the githubauthentication token the credentials.json file and mentioned the path of the package repository on github to install in against github : [ ] in "cluster.json", but I'm not sure if the packages are being installed from github.

I searched a lot on the web for the documentation but couldn't find it. So, had to break the rule of the issue template. I'm sorry. But help would be very very appreciated.

Example Code I'm running:


gbmGrid <- expand.grid(interaction.depth = 5,
                           n.trees = 100,
                           shrinkage = 0.1,
                           n.minobsinnode = 10)

ctrl_gbm <- trainControl(method = "repeatedcv",
                               number = 2,
                               repeats = 1,
                               summaryFunction = multiClassSummary,
                               classProbs = TRUE,
                               verboseIter = TRUE)

  tuned_fit_gbm <- train(x = train_data[, names(train_data) != dependant_var],
                                        y = train_data[, names(train_data) == dependant_var],
                                        method = "gbm",
                                        verbose = TRUE,
                                        metric = metric_to_use,
                                        trControl = ctrl_gbm,
                                        tuneGrid = gbmGrid,
                                        weights = model_weights_to_use)

My Session Info:

brnleehng commented 6 years ago

Hey @kchaitanyabandi

Caret github package installation is somewhat different because they have their R project located in a subdirectory of the github repo. In this case, it's located in "~/pkg/caret" in

The cluster configuration file for github installation needs a path for installing packages on every node. The most common github package installation is repo_name/project_name (For example, Azure/doAzureParallel). You only need github authentication token if you are using a private repo on github. Otherwise you can leave that blank.

I've added a sample cluster config file.

Here's a link to our documentation for github installation.

I can also test it out if you have a sample data set Let me know if you have any more questions!

Thanks Brian

  "name": "caret-pool",
  "vmSize": "Standard_F2",
  "maxTasksPerNode": 2,
  "poolSize": {
    "dedicatedNodes": {
      "min": 0,
      "max": 0
    "lowPriorityNodes": {
      "min": 2,
      "max": 2
    "autoscaleFormula": "QUEUE"
  "containerImage": "rocker/tidyverse:latest",
  "rPackages": {
    "cran": [],
    "github": ["topepo/caret/pkg/caret", "Azure/doAzureParallel"],
    "bioconductor": []
  "commandLine": []
kchaitanyabandi commented 6 years ago

Hey @brnleehng

I used the sample cluster config file you commented and the latest development version of caret got installed. But, still the problem persists and the important thing is that it isn't working only for multi-class classification data. For Binary Classification and Regression, it is working perfectly fine.

I am not authorized to share my data because of confidentiality reasons, but you could try it with any data containing multiple class labels (>2). The code that I am trying to run is as follows.

ctrl_gbm <- trainControl(method = "repeatedcv",
                                         number = 10,
                                         repeats = 5,
                                         summaryFunction = multiClassSummary,
                                         classProbs = TRUE,
                                         verboseIter = TRUE)

gbmGrid <- expand.grid(nrounds = 100,
                                        max_depth = 5,
                                        eta = .05,
                                        gamma = 0,
                                        colsample_bytree = c(.6, .7),
                                        min_child_weight = 1,
                                        subsample = .8)


tuned_fit_xgb <- train(x = xtrain,
                                    y = ytrain,
                                    method = "gbm",
                                    verbose = TRUE,
                                    metric = "logLoss",
                                    trControl = ctrl_gbm,
                                    tuneGrid = gbmGrid)
brnleehng commented 6 years ago

Hi @kchaitanyabandi

I was able to reproduce the error on the caret sample because of a missing R algorithm package (randomForest R package). To avoid missing any algorithm R packages, you can use a caret dockerfile that has a lot of the algorithm packages already installed.

This includes ranger and glm R packages already installed.

Error: names(resamples) <- gsub("^\.", "", names(resamples)) : attempt to set an attribute on NULL

No traceback available In the cluster file, we referenced the docker image as **jrowen/dcaret:latest** in the containerImage property of json, shown below. ``` json { "name": "caret-pool", "vmSize": "Standard_F2", "maxTasksPerNode": 2, "poolSize": { "dedicatedNodes": { "min": 0, "max": 0 }, "lowPriorityNodes": { "min": 2, "max": 2 }, "autoscaleFormula": "QUEUE" }, "containerImage": "jrowen/dcaret:latest", "rPackages": { "cran": ["MLmetrics", "e1071"], "github": [], "bioconductor": [] }, "commandLine": [] } ``` Specifically, for multiClassSummary function, it appears there are missing packages for MLmetrics and e1071. I've also added these to the cluster configuration file. Thanks, Brian
kchaitanyabandi commented 6 years ago

Hey Brian,

You're awesome. It solved the issue. Thank you so much for the quick debug and reply. I just wanted to know how you checked for the error that said a package was missing.

Thanks Krishna

brnleehng commented 6 years ago


Thanks for the response! We added the cluster config file to our sample #237.

Using the job id that's printed on the console, you can navigate through the Azure Portal or BatchLabs (Our tool for monitoring Batch jobs, maybe the easiest way to navigate).

By going to the job tab of the Azure Portal or BatchLabs, From there, there will be a list of tasks. By clicking one of the tasks, you will see a folder with stdout.txt, stderr.txt and the [The id of the task].txt

If you click on the [Id of the task].txt, you will get the R console output.

kchaitanyabandi commented 6 years ago

Hey @brnleehng

The fix you provided had worked for moment and then, again it gave me an error with the following grid I was using for Multi Class Classification.

gbmGrid <- expand.grid(interaction.depth = 10:20,
                           n.trees = c(100, 150, 200, 250, 300, 350, 400, 450, 500, 1000, 1175, 1250, 1300),
                           shrinkage = c(0.025, .05, .1, 0.2, 0.3),
                           n.minobsinnode = c(5:10, 20, 30))

ctrl_gbm <- trainControl(method = "repeatedcv",
                               number = 10,
                               repeats = 5,
                               summaryFunction = multiClassSummary,
                               classProbs = TRUE,
                               verboseIter = TRUE)

tuned_fit_gbm <- train(x = train_data[, names(train_data) != dep_var],
                         y = train_data[, names(train_data) == dep_var],
                         method = "gbm",
                         verbose = TRUE,
                         metric = "logLoss",
                         trControl = ctrl_gbm,
                         tuneGrid = gbmGrid)

The grid has a total of 22000 tasks it submitted to the Batch Pool and the same error

Error in names(resamples) <- gsub("^\.", "", names(resamples)) : attempt to set an attribute on NULL

popped up after 239 successfully completed tasks. I wonder why, but I need to check the logs using the method you suggested using Batch Labs application.

I'm in the process of getting the login credentials for the Azure Batch Account. In the meanwhile, is there any other method using which I can see those files from R console ?

brnleehng commented 6 years ago

Hi @kchaitanyabandi

To get files from the job, we have a getJobFile api. We have some documentation on getting files from the node. Link

In order to get the logs, you'll need to get the job id and the task that failed. You'll want to get the [Task Id].txt (For example, 1.txt) because that'll contain the logs from your R program.

# Get the logs from task 1 that was run by R
taskLogs <- getJobFile("job20180322051216", "1", "wd/1.txt")

# Get the stdout output from task 2
stdoutLogs <- getJobFile("job20180322051216", "2", "stdout.txt")

Here's a structure of the job directory on the node

Thanks, Brian

kchaitanyabandi commented 6 years ago

Hey Brian,

I checked the error logs of the failed tasks, and I couldn't quite understand what might have gone wrong. The following is the output that says Error Code: 0 at the bottom. Could you please share any insight that might indicate what might have gone wrong ?

- Fold01.Rep1: shrinkage=0.100, interaction.depth=18, n.minobsinnode= 6, n.trees=1300 
Error Code: 0
brnleehng commented 6 years ago

Looks like there's no errors occurred based on the logs..

Does BatchLabs say the tasks have errors? Can you look at the stderr.txt and stdout.txt files? Are the ValidDeviance '-nan' a valid answer?

If you have a sample dataset (I'm having a tough job, finding a dataset) and a working sample, that I can use to reproduce the problem. That'll be helpful.


kchaitanyabandi commented 6 years ago

Hey Brian,

Please send me your email id to me on and I'll send you the sample dataset.

Thanks Krishna

brnleehng commented 6 years ago

Working through via offline