Closed robertocerinaprojects closed 3 years ago
This error occurs when something malformed is returned from the scoring function. For example, this will happen when a list is returned for Score instead of a scalar. Unfortunately the object hasn't been constructed yet so nothing is returned from bayesOpt, so you can't really see what parameter combinations resulted in the error. One way around this is to manually run the code in the Initialization Setup with a seed and see what parameters it attempted, and see if any of them result in errors when passed to your scoring function:
This code snippet should make your parameters for you:
set.seed(1991)
boundsDT <- ParBayesianOptimization:::boundsToDT(bounds)
initGrid <- ParBayesianOptimization:::randParams(boundsDT, 600)
When you discover what the problem was, can you tell me on here please. There is code that is supposed to catch these situations but apparently it didn't in this case, I would like to know what happened.
Thanks Sam - I will try it in a few minutes, will report back here in detail.
Hi Sam - sorry for the delay in this - took a long time to test.
The issue is there doesn't appear to be an error if I run the score function on the initgrid as you suggested - see below:
set.seed(1991)
boundsDT <- ParBayesianOptimization:::boundsToDT(bounds)
initGrid <- ParBayesianOptimization:::randParams(boundsDT, 600)
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
clusterExport(cl,c('Folds','dtrain','SF_complete_temp','pop.count.vector','test.data','results.2016','ilogit','logit',
'n.folds','scoringFunctionBinary.Uswing','boundsDT','initGrid','state.strat.frame'))
clusterEvalQ(cl,expr= {
library(xgboost)
library(data.table)
})
start.time = Sys.time()
temp = parSapply(cl = cl,X = 1:dim(initGrid)[1],FUN = function(i){
score =
scoringFunctionBinary.Uswing(max_depth = unlist(initGrid[i,"max_depth"]),
min_child_weight = unlist(initGrid[i,"min_child_weight"]),
subsample = unlist(initGrid[i,"subsample"]),
colsample_bytree = unlist(initGrid[i,"colsample_bytree"]),
eta = unlist(initGrid[i,"eta"]),
gamma = unlist(initGrid[i,"gamma"]),
nround = unlist(initGrid[i,"nround"]));
return(score)
})
end.time = Sys.time()
end.time - start.time
stopCluster(cl)
registerDoSEQ()
If you then run which(unlist(lapply(temp ,length))>1)
to check if any of them are inconsistent, you get a zero response.
This is a bit spooky because my code returns the error multiple times when using the proper bayesOpt function.
Do you have any advice for further testing ?
Many thanks, R
The error actually occurred when trying to rbindlist the scoreSummary list that was returned here.
Can you run data.table::rbindlist
successfully on the returned list? If that completes successfully, then it is a rare problem that will likely require many different iterations to figure out.
Indeed it does not ! so the temp object resulting from the parsapply above looks like this:
> temp
$Score
[1] -20.80486
$Score
[1] -19.82228
$Score
[1] -19.71769
$Score
[1] -20.19688
and applying the function gives this error:
> scoreSummary <- rbindlist(temp)
Error in rbindlist(temp) :
Item 1 of input is not a data.frame, data.table or list
I wonder if it's the way I specified the score function ? it's not a straightforward OOB score - I'm using performance on an external dataset to score - see here:
scoringFunctionBinary.Uswing <- function( max_depth ,
min_child_weight ,
subsample ,
colsample_bytree ,
eta,
gamma,
nround
) {
dtrain = xgb.DMatrix(as.matrix(dtrain[,colnames(test.data),with=FALSE]),label = dtrain$vote2020)
test.data = xgb.DMatrix(data = as.matrix(test.data))
Pars <- list(
booster = "gbtree",
max_depth = max_depth,
min_child_weight = min_child_weight,
subsample = subsample,
colsample_bytree = colsample_bytree,
eta = eta,
gamma = gamma,
nround = nround,
objective = "binary:logistic"
)
res.table = data.table()
for(i in 1:length(Folds)){
xgb.train = xgboost::xgb.train(data = dtrain[Folds[[i]],],
params = Pars[-which(names(Pars)=="nround")],
nrounds = Pars$nround,
nthread = 4,
verbose = 2)
# take a minimal subsample to speed up cross-validation
xgb.pred = predict(object = xgb.train,newdata = test.data)
state.strat.frame$vote2020_pred = xgb.pred
test.data.pred = state.strat.frame[,lapply(.SD,function(x){sum(x*raw_count)}),by = c('state'),.SDcols = c('vote2020_pred')]
test.data.count = state.strat.frame[,lapply(.SD,function(x){sum(x)}),by = c('state'),.SDcols = c('raw_count')]
test.data.temp = merge(test.data.pred,test.data.count,by = 'state')
test.data.temp$R_pct_pred = test.data.temp$vote2020_pred/test.data.temp$raw_count
test.data.temp = test.data.temp[match(results.2016$state,test.data.temp$state),]
residual = results.2016$Uswing_Pred - test.data.temp$R_pct_pred
res.table = cbind(res.table,data.table(res2 = 100*(residual)))
}
return(
list(
Score = -sqrt(mean(rowMeans(res.table^2)))
)
)
}
I don't see anything wrong with your scoring function. The result from each loop in the foreach
will be a data.table
, so running rbindlist directly on temp
wouldn't work. At this point I need a reproduceable example to help, trying to diagnose problems from user-defined functions on user data that I don't have access to is a bit of a headache.
Excellent package - been working with it and XGBoost for some time now and it shows incredible promise.
I am however facing the following error - see below. I'm not sure where that is coming from: Error in rbindlist(scoreSummary) : Column 2 of item 1 is length 9 inconsistent with column 1 which is length 100. Only length-1 columns are recycled.
Do you have any idea what is causing this ?
Again many thanks for the package, Roberto