HannaMeyer / CAST

Developer Version of the R package CAST: Caret Applications for Spatio-Temporal models
https://hannameyer.github.io/CAST/
108 stars 30 forks source link

TuneGrid causes error with ranger method in ffs #17

Closed khalilT closed 3 years ago

khalilT commented 3 years ago

Hi Hanna,

Thank you a lot for the great package! I noticed that when using the package with the ranger method, it is not possible to use a tuneGrid dataframe as with caret. It is just possible to set the tuneLength argument (luckily).

When using a tune grid, R throws the following error:

[1] "model using NDVI,soil_moist will be trained now..." Something is wrong; all the RMSE metric values are missing: RMSE Rsquared MAE Min. : NA Min. : NA Min. : NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA Median : NA Median : NA Median : NA Mean :NaN Mean :NaN Mean :NaN 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA Max. : NA Max. : NA Max. : NA NA's :1 NA's :1 NA's :1 Error: Stopping In addition: Warning message: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :

I tried adding the metric argument metric = c("RMSE"). But it didn't work.

Here is my code:

train_ffs_model <- function(data){
  #tuneGrid_ffs <- expand.grid(mtry = 3, splitrule = "variance", min.node.size = 5)
  predictors <- setdiff(names(data), c("x","y","region","Lstmean","geometry"))
  folds <- CAST::CreateSpacetimeFolds(data,spacevar = "region",k=7)
  model <- CAST::ffs(data[,predictors],data$Lstmean,
                     method="ranger",
                     importance = "permutation",
                     tuneLength = 1,
                     #tuneGrid = tuneGrid_ffs,
                     trControl=trainControl(method="cv",number=10,
                                            index = folds$index,indexOut = folds$indexOut))
  return(model)

}
library(parallel)
library(doParallel)
cl <- makePSOCKcluster(4)
registerDoParallel(cl)

ffs_models_03_11 <- purrr::map(.x = data_years_03_11, .f = train_ffs_model)

stopCluster(cl)

I ended up commenting out the tuneGrid and relying only on the tunelength. However, I would like to have more control on the hyperparameters. And since the dataframes are quite big (80k rows), ranger is much faster than RF. or am I doing something wrong ?

Thank you!

HannaMeyer commented 3 years ago

The reason is that mtry=3 that you selected is invalid for most of the models trained during the feature selection. ffs starts using all combinations of 2 variables, and the maximum mtry allowed for such a model is mtry=2. method="rf" is handling this by automatically resetting mtry to a valid range, ranger is not doing this (you should see a message "Error: mtry can not be larger than number of variables in data. Ranger will EXIT now."

I adjusted the ffs so that for ranger it's resetting mtry to a valid range now. Please check.

khalilT commented 3 years ago

I tried it again and got the warning message. The model with the right mtry (2) did run. Thank you :)