BigDataWUR / AgML-CY-Bench

CY-Bench (Crop Yield Benchmark) is a comprehensive dataset and benchmark to forecast crop yields at subnational level. CY-Bench standardizes selection, processing and spatio-temporal harmonization of public subnational yield statistics with relevant predictors. Contributors include agronomers, climate scientists and machine learning researchers.
https://cybench.agml.org/
Other
9 stars 3 forks source link

NN hyperparameter tuning / nested cross-validation #175

Closed aikepotze closed 1 month ago

aikepotze commented 1 month ago

Added option to do hyperparameter tuning within NN .fit() function. Tuning can be done with randomly sampled validation set for each outer fold or with k-fold cross validation. Standard set to randomly sampled for quicker testing.

Initializes new models to test hyperparameter combinations instead of resetting itself. Let me know if you think this is correct.

After finding best hparam combination the model trains with these settings and with a randomly defined validation set defined by val_fraction. This can be set to 0 in final benchmark, or can be used for early stopping of the final model.

Did some initial testing on benchmark with 2 and 3 layer lstms, both were slightly worse than averageyieldmodel.. So we need to do more tuning. Early stopping might help, or removing local average yield from signal.

I do not have much time today to do more experiments. Feel free to experiment and let me know if you have questions.

krsnapaudel commented 1 month ago

Is it possible to do a group k-fold with groups based on years? Example code:

groups = data[KEY_YEAR].values
group_kfold = GroupKFold(n_splits=5)
for i, (train_index, valid_index) in enumerate(group_kfold.split(data[<keys_x>], data[KEY_TARGET], groups)):
  valid_years = list(groups[test_index])
  valid_data = data[data[KEY_YEAR].isin(valid_years)]
  train_data = data[~data[KEY_YEAR].isin(valid_years)]

The code does seem to do random selection of years. So looks good to me.

aikepotze commented 1 month ago

Is it possible to do a group k-fold with groups based on years? Example code:

groups = data[KEY_YEAR].values
group_kfold = GroupKFold(n_splits=5)
for i, (train_index, valid_index) in enumerate(group_kfold.split(data[<keys_x>], data[KEY_TARGET], groups)):
  valid_years = list(groups[test_index])
  valid_data = data[data[KEY_YEAR].isin(valid_years)]
  train_data = data[~data[KEY_YEAR].isin(valid_years)]

The code does seem to do random selection of years. So looks good to me.

if do_kfold:
    # Split data into k folds
    all_years = dataset.years
    list_all_years = list(all_years)
    random.shuffle(list_all_years)
    cv_folds = [list_all_years[i::kfolds] for i in range(kfolds)]
    # For each fold, create new model and datasets, train and record val loss. Finally, average val loss.
    val_loss_fold = []
    for j, val_fold in enumerate(cv_folds):
         print(f"Running inner fold {j+1}/{kfolds} for hyperparameter setting {i+1}/{len(settings)}")
         val_years = val_fold
         train_years = [y for y in all_years if y not in val_years]
         train_dataset, val_dataset = dataset.split_on_years((train_years, val_years))
         new_model = self.__class__(**self._init_args)
         _, output = new_model.train_model(train_dataset=train_dataset, val_dataset=val_dataset, *args, **setting)
         val_loss_fold.append(output["val_loss"])
         val_loss = np.mean(val_loss_fold)

Code above divides training set into k folds based on years.

aikepotze commented 1 month ago

Ready to merge. Currently implemented early stopping. It stores a copy of the best model under model.best_model and uses it in the .predict_batch function.