autogluon / autogluon

Fast and Accurate ML in 3 Lines of Code
https://auto.gluon.ai/
Apache License 2.0
7.82k stars 913 forks source link

Support manual assignment of cross-validation folds #4014

Closed dluks closed 4 months ago

dluks commented 6 months ago

Description

This feature request pertains to the tabular module.

For some types of data, such as geospatial data, spatial autocorrelation can confound the evaluation of k-fold cross-validation. In cases like these, spatially aware fold assignment is especially useful. With this in mind, it would greatly enhance the confidence in TabularPredictor performance results if custom fold assignment IDs could be manually pre-assigned and then passed to TabularPredictor.fit().

References

Innixma commented 6 months ago

Thanks for creating an issue and providing a variety of links!

We do support manual group assignment. Refer to the groups parameter in https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html.

Can you try this logic and see if it satisfies your needs?

If not, could you provide a reproducible code example that uses a dataset where such manual assignment leads to an improved result? We can use this example to test potential solutions.

alberto-jj commented 5 months ago

Hi, following up on this.

Could you please confirm if the mentioned "groups" parameter in the TabularPredictor class could be used to perform a typical K-fold cross-validation (or stratified CV) where the AutoGluon models will be trained in all groups/folds except by one (the validation fold)?

Below is a random df with a "Fold" column that I want to use to define the training/validation splits used for the internal bagging of AutoGluon.

Is my interpretation correct? or am I misunderstanding the documentation? Thanks a lot for your reply!

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from autogluon.tabular import TabularDataset, TabularPredictor

# Generate random data
np.random.seed(42)
data_size = 300
feature1 = np.random.normal(loc=0, scale=1, size=data_size)
feature2 = np.random.normal(loc=5, scale=2, size=data_size)
groups = np.random.choice(['Group_A', 'Group_B', 'Group_C'], size=data_size)

# Create DataFrame
df = pd.DataFrame({
    'Feature1': feature1,
    'Feature2': feature2,
    'Group': groups
})

# Apply Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fold_column = np.zeros(data_size, dtype=int)

for fold, (train_index, test_index) in enumerate(skf.split(df, df['Group'])):
    fold_column[test_index] = fold

#Add the Fold column (to be used as "groups" in TabularPredictor)
df['Fold'] = fold_column

#AutoGluon using user-defined stratified K-fold CV
train_data = df
label = 'Group'
groups = 'Fold'
save_path = 'data/test_folds_automl/'  

predictor = TabularPredictor(label=label, path=save_path, groups=groups, eval_metric='balanced_accuracy',
                             sample_weight='auto_weight').fit(train_data, time_limit=80, auto_stack=True)
dluks commented 4 months ago

Hi @Innixma, my apologies for the delay despite your quick response! Things got a bit side-tracked...

Anyway, I was able to assign my folds via a column that I then passed to groups and it seems to do what I would expect after also setting num_bag_folds=10 (I have 10 folds). In this case, I'm assuming, then, that my final score_eval is a simple mean of the 10 models? If so, it would be great to also be able to retrieve the standard deviation, though I suppose this could be manually done by re-calculating each fold's performance metrics with their respective models. So yeah, I think this does exactly what I had hoped!

Side notes that probably belong in other tickets:

One thing I noticed, though, is that it does seem to exclude the neural network models from being trained. Is this expected, due to the experimental status of the groups functionality? This may also be due to my time_limit + data size, though, and those models are just never being reached due to time constraints. I'll explore further and let you know.

Last note, I also encountered what may be a bug when specifying num_gpus wtih groups and preset="best_quality" where the sub-fit just hangs and never actually does anything. I'll also test this further and report if I can narrow down the settings to reproduce it.

Innixma commented 4 months ago

Thanks @dluks! We don't support getting std dev of folds yet, but it is something we plan to add in an upcoming release. For the NeuralNetwork issue, a reproducible example in a new ticket would be great if you find the time so we can take a closer look. I don't expect this to be the case, and am surprised this would cause an issue. It is probably the time limit.

Regarding the GPU hanging, this is fixed in pre-release / mainline and will be part of the next release.

@alberto-jj Yes should work.