Support manual assignment of cross-validation folds

Description

This feature request pertains to the tabular module.

For some types of data, such as geospatial data, spatial autocorrelation can confound the evaluation of k-fold cross-validation. In cases like these, spatially aware fold assignment is especially useful. With this in mind, it would greatly enhance the confidence in TabularPredictor performance results if custom fold assignment IDs could be manually pre-assigned and then passed to TabularPredictor.fit().

References

Spatial validation reveals poor predictive performance of large-scale ecological mapping models
Estimating the prediction performance of spatial models via spatial k-fold cross validation
Cross-validation strategy impacts the performance and interpretation of machine learning models
Spatially autocorrelated training and validation samples inflate performance assessment of convolutional neural networks
Assessing and improving the transferability of current global spatial prediction models
blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models
SamComber/spacv: Spatial cross-validation in Python.

Thanks for creating an issue and providing a variety of links!

We do support manual group assignment. Refer to the groups parameter in https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html.

Can you try this logic and see if it satisfies your needs?

If not, could you provide a reproducible code example that uses a dataset where such manual assignment leads to an improved result? We can use this example to test potential solutions.

Hi, following up on this.

Could you please confirm if the mentioned "groups" parameter in the TabularPredictor class could be used to perform a typical K-fold cross-validation (or stratified CV) where the AutoGluon models will be trained in all groups/folds except by one (the validation fold)?

Below is a random df with a "Fold" column that I want to use to define the training/validation splits used for the internal bagging of AutoGluon.

Is my interpretation correct? or am I misunderstanding the documentation? Thanks a lot for your reply!

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from autogluon.tabular import TabularDataset, TabularPredictor

# Generate random data
np.random.seed(42)
data_size = 300
feature1 = np.random.normal(loc=0, scale=1, size=data_size)
feature2 = np.random.normal(loc=5, scale=2, size=data_size)
groups = np.random.choice(['Group_A', 'Group_B', 'Group_C'], size=data_size)

# Create DataFrame
df = pd.DataFrame({
    'Feature1': feature1,
    'Feature2': feature2,
    'Group': groups
})

# Apply Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fold_column = np.zeros(data_size, dtype=int)

for fold, (train_index, test_index) in enumerate(skf.split(df, df['Group'])):
    fold_column[test_index] = fold

#Add the Fold column (to be used as "groups" in TabularPredictor)
df['Fold'] = fold_column

#AutoGluon using user-defined stratified K-fold CV
train_data = df
label = 'Group'
groups = 'Fold'
save_path = 'data/test_folds_automl/'  

predictor = TabularPredictor(label=label, path=save_path, groups=groups, eval_metric='balanced_accuracy',
                             sample_weight='auto_weight').fit(train_data, time_limit=80, auto_stack=True)

Hi @Innixma, my apologies for the delay despite your quick response! Things got a bit side-tracked...

Anyway, I was able to assign my folds via a column that I then passed to groups and it seems to do what I would expect after also setting num_bag_folds=10 (I have 10 folds). In this case, I'm assuming, then, that my final score_eval is a simple mean of the 10 models? If so, it would be great to also be able to retrieve the standard deviation, though I suppose this could be manually done by re-calculating each fold's performance metrics with their respective models. So yeah, I think this does exactly what I had hoped!

Side notes that probably belong in other tickets:

One thing I noticed, though, is that it does seem to exclude the neural network models from being trained. Is this expected, due to the experimental status of the groups functionality? This may also be due to my time_limit + data size, though, and those models are just never being reached due to time constraints. I'll explore further and let you know.

Last note, I also encountered what may be a bug when specifying num_gpus wtih groups and preset="best_quality" where the sub-fit just hangs and never actually does anything. I'll also test this further and report if I can narrow down the settings to reproduce it.

Thanks @dluks! We don't support getting std dev of folds yet, but it is something we plan to add in an upcoming release. For the NeuralNetwork issue, a reproducible example in a new ticket would be great if you find the time so we can take a closer look. I don't expect this to be the case, and am surprised this would cause an issue. It is probably the time limit.

Regarding the GPU hanging, this is fixed in pre-release / mainline and will be part of the next release.

@alberto-jj Yes should work.

autogluon / autogluon