PhilippPro / tuneRanger

Automatic tuning of random forests
32 stars 9 forks source link

Error in tuneRanger when using case weights #9

Closed sschooler closed 1 year ago

sschooler commented 3 years ago

Hello, I have a large data frame and am attempting to run tuneRanger on it using case weights. I continue to get the error: "Error in evalTargetFun.OptState(opt.state, xs, extras) : Objective function output must be a numeric of length 1, but we got: NaN"

For the data frame, I want to weight the random forest using the last column (range 0 - 26.25), and do not want to include the secod to last column in the analysis.

div.data <- read.csv("~/alldata.csv", row.names = 1)
task.com <- makeRegrTask(data = div.data[,1:13], target = "community",
                         weights = div.data$total.species)

tune.com <- tuneRanger(task.com, num.trees = 100, parameters = list(replace = TRUE, respect.unordered.factors = "order"))

If I run tuneRanger without the weights added, it works fine.

If I only use the first 5000 observations (all are under 1) and add +1 to the weight vector, it works sometimes. If I normalize the vector and add 1, it works for the first 20000 or so sometimes.

When I run it simply in ranger it works just fine:

rf1 <- ranger(community~., num.trees = 1000, data = div.data[,1:13], case.weights = div.data$total.species,
               mtry = 11, min.node.size = 3, sample.fraction = 1,
              importance = "impurity")

Any ideas as to why this issue is coming up? I tried using tuneRanger with weights on the iris dataset (weighted using column Sepal_Length) and it worked, so I'm assuming it is something in my dataset, but I can't figure out what it is.

Thanks for any guidance you may be able to provide!

PhilippPro commented 3 years ago

Dear @sschooler, can you maybe provide a minimal example (e.g. with the iris dataset that is freely available in R), where you show that it does not work. Without having the underlying data, it is hard to get your problem.

chris-s-bowden commented 1 year ago

Dear @PhilippPro, I have encountered the exact same error as @sschooler and continue to encounter the same problem in the following minimal example using iris:

data(iris)

# Remove 'versicolor' observations for simplicity of this example
iris_flt <- iris %>% filter(Species != 'versicolor')
iris_flt$Species <- factor(iris_flt$Species, levels = c('setosa', 'virginica'))

# Add weights for each species
temp <- iris_flt %>% mutate(weights = ifelse(Species == 'setosa', 0.2,
                                                 ifelse(Species == 'virginica', 8.5, NA)))

# Create numeric vector of weights
iris_weights <- temp$weights

# Create classification task including weights
iris.task <- makeClassifTask(data = iris_flt, target = 'Species', weights = iris_weights)

# This following line is where the error is produced
iris.tune <- tuneRanger(iris.task)

Thanks for taking the time to read this, fingers crossed it's something simple I've missed!

PhilippPro commented 1 year ago

I found out the problem. If you set the weights very differently for some observations then some observations will not be chosen in none of the trees while bagging and hence the out-of-bag predictions cannot be used for the tuning.

Minimal example that works with less different weights (0.4 instead of 8.5):

library(tuneRanger)
library(mlr)
library(dplyr)

data(iris)

# Remove 'versicolor' observations for simplicity of this example
iris_flt <- iris %>% filter(Species != 'versicolor')
iris_flt$Species <- factor(iris_flt$Species, levels = c('setosa', 'virginica'))

# Add weights for each species
temp <- iris_flt %>% mutate(weights = ifelse(Species == 'setosa', 0.2,
                                             ifelse(Species == 'virginica', 0.4, NA)))

# Create numeric vector of weights
iris_weights <- temp$weights

# Create classification task including weights
iris.task <- makeClassifTask(data = iris_flt, target = 'Species', weights = iris_weights)

# This following line is where the error is produced
iris.tune <- tuneRanger(iris.task)

If you choose such weighting as above it would also be better to leave this observations out. ;)

Alternatively you can set the number of trees to a very high value, such that there are sampled observations also for this very unequal case of weighting. In your case setting even setting the num.trees to 100000 did not completely solve the issue:

Hope this helped to clarify.