AnotherSamWilson / ParBayesianOptimization

Parallelizable Bayesian Optimization in R
107 stars 18 forks source link

Error after initial search is completed #30

Closed robertocerinaprojects closed 3 years ago

robertocerinaprojects commented 3 years ago

Excellent package - been working with it and XGBoost for some time now and it shows incredible promise.

I am however facing the following error - see below. I'm not sure where that is coming from: Error in rbindlist(scoreSummary) : Column 2 of item 1 is length 9 inconsistent with column 1 which is length 100. Only length-1 columns are recycled.

Do you have any idea what is causing this ?

Again many thanks for the package, Roberto

> library(doParallel)
Loading required package: iterators
Loading required package: parallel
> cl <- makeCluster(4)
> registerDoParallel(cl)
> clusterExport(cl,c('Folds','dtrain','SF_complete_temp','pop.count.vector','test.data','results.2016','ilogit','logit','n.folds'))
> clusterEvalQ(cl,expr= {
+   library(xgboost)
+ })
[[1]]
[1] "xgboost"   "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

[[2]]
[1] "xgboost"   "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

[[3]]
[1] "xgboost"   "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

[[4]]
[1] "xgboost"   "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     

> optObj.Uswing <- bayesOpt(
+   FUN = scoringFunctionBinary.Uswing,
+   bounds = bounds,
+   initPoints = 600,
+   iters.n = 4*100, # 4 is the number of draws in each epoch, and the number it is multiplied by represents the number of epochs. 
+   iters.k = 4,
+   acq = 'ei',
+   kappa = 5,
+   eps = 0.1,
+   saveFile = 'Generated Quantities/RVote_OptimizedRF_Uswing.RDS',
+   verbose = TRUE,
+   parallel = TRUE,
+   plotProgress = TRUE#,
+   #otherHalting = list(timeLimit = 60*15) 
+ )

Running initial scoring function 600 times in 4 thread(s)...  45617.86 seconds
Error in rbindlist(scoreSummary) : 
  Column 2 of item 1 is length 9 inconsistent with column 1 which is length 100. Only length-1 columns are recycled.
> stopCluster(cl)
> registerDoSEQ()
AnotherSamWilson commented 3 years ago

This error occurs when something malformed is returned from the scoring function. For example, this will happen when a list is returned for Score instead of a scalar. Unfortunately the object hasn't been constructed yet so nothing is returned from bayesOpt, so you can't really see what parameter combinations resulted in the error. One way around this is to manually run the code in the Initialization Setup with a seed and see what parameters it attempted, and see if any of them result in errors when passed to your scoring function:

This code snippet should make your parameters for you:

set.seed(1991)
boundsDT <- ParBayesianOptimization:::boundsToDT(bounds)
initGrid <- ParBayesianOptimization:::randParams(boundsDT, 600)
AnotherSamWilson commented 3 years ago

When you discover what the problem was, can you tell me on here please. There is code that is supposed to catch these situations but apparently it didn't in this case, I would like to know what happened.

robertocerinaprojects commented 3 years ago

Thanks Sam - I will try it in a few minutes, will report back here in detail.

robertocerinaprojects commented 3 years ago

Hi Sam - sorry for the delay in this - took a long time to test.

The issue is there doesn't appear to be an error if I run the score function on the initgrid as you suggested - see below:

set.seed(1991)
boundsDT <- ParBayesianOptimization:::boundsToDT(bounds)
initGrid <- ParBayesianOptimization:::randParams(boundsDT, 600)

library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
clusterExport(cl,c('Folds','dtrain','SF_complete_temp','pop.count.vector','test.data','results.2016','ilogit','logit',
                   'n.folds','scoringFunctionBinary.Uswing','boundsDT','initGrid','state.strat.frame'))
clusterEvalQ(cl,expr= {
  library(xgboost)
  library(data.table)
})
start.time = Sys.time()
temp = parSapply(cl = cl,X = 1:dim(initGrid)[1],FUN = function(i){
score = 
scoringFunctionBinary.Uswing(max_depth = unlist(initGrid[i,"max_depth"]),
                             min_child_weight = unlist(initGrid[i,"min_child_weight"]),
                             subsample = unlist(initGrid[i,"subsample"]),
                             colsample_bytree = unlist(initGrid[i,"colsample_bytree"]),
                             eta = unlist(initGrid[i,"eta"]),
                             gamma = unlist(initGrid[i,"gamma"]),
                             nround = unlist(initGrid[i,"nround"]));
  return(score)
  })
end.time = Sys.time()
end.time - start.time
stopCluster(cl)
registerDoSEQ()

If you then run which(unlist(lapply(temp ,length))>1) to check if any of them are inconsistent, you get a zero response.

This is a bit spooky because my code returns the error multiple times when using the proper bayesOpt function.

Do you have any advice for further testing ?

Many thanks, R

AnotherSamWilson commented 3 years ago

The error actually occurred when trying to rbindlist the scoreSummary list that was returned here. Can you run data.table::rbindlist successfully on the returned list? If that completes successfully, then it is a rare problem that will likely require many different iterations to figure out.

robertocerinaprojects commented 3 years ago

Indeed it does not ! so the temp object resulting from the parsapply above looks like this:

> temp
$Score
[1] -20.80486

$Score
[1] -19.82228

$Score
[1] -19.71769

$Score
[1] -20.19688

and applying the function gives this error:


> scoreSummary <- rbindlist(temp)
Error in rbindlist(temp) : 
  Item 1 of input is not a data.frame, data.table or list

I wonder if it's the way I specified the score function ? it's not a straightforward OOB score - I'm using performance on an external dataset to score - see here:


scoringFunctionBinary.Uswing <- function(    max_depth ,
                                             min_child_weight ,
                                             subsample ,
                                             colsample_bytree ,
                                             eta,
                                             gamma,
                                             nround
) {

  dtrain = xgb.DMatrix(as.matrix(dtrain[,colnames(test.data),with=FALSE]),label = dtrain$vote2020)
  test.data = xgb.DMatrix(data = as.matrix(test.data))

  Pars <- list( 
    booster = "gbtree",
    max_depth = max_depth,
    min_child_weight = min_child_weight,
    subsample = subsample,
    colsample_bytree = colsample_bytree,
    eta = eta,
    gamma = gamma,
    nround = nround,
    objective = "binary:logistic"
  )

  res.table = data.table()
  for(i in 1:length(Folds)){
    xgb.train = xgboost::xgb.train(data = dtrain[Folds[[i]],],
                                   params = Pars[-which(names(Pars)=="nround")],
                                   nrounds = Pars$nround,
                                   nthread = 4,
                                   verbose = 2)

    # take a minimal subsample to speed up cross-validation
    xgb.pred = predict(object = xgb.train,newdata = test.data)
    state.strat.frame$vote2020_pred = xgb.pred
    test.data.pred = state.strat.frame[,lapply(.SD,function(x){sum(x*raw_count)}),by = c('state'),.SDcols = c('vote2020_pred')]
    test.data.count = state.strat.frame[,lapply(.SD,function(x){sum(x)}),by = c('state'),.SDcols = c('raw_count')]
    test.data.temp = merge(test.data.pred,test.data.count,by = 'state')
    test.data.temp$R_pct_pred = test.data.temp$vote2020_pred/test.data.temp$raw_count
    test.data.temp = test.data.temp[match(results.2016$state,test.data.temp$state),]
    residual = results.2016$Uswing_Pred - test.data.temp$R_pct_pred
    res.table = cbind(res.table,data.table(res2 = 100*(residual)))
  } 

  return(
    list( 
      Score = -sqrt(mean(rowMeans(res.table^2)))
    )
  )
}
AnotherSamWilson commented 3 years ago

I don't see anything wrong with your scoring function. The result from each loop in the foreach will be a data.table, so running rbindlist directly on temp wouldn't work. At this point I need a reproduceable example to help, trying to diagnose problems from user-defined functions on user data that I don't have access to is a bit of a headache.