autogluon / autogluon

Fast and Accurate ML in 3 Lines of Code
https://auto.gluon.ai/
Apache License 2.0
7.39k stars 876 forks source link

Autogluon 0.7.0 vs Autogluon 1.1.0 performance degradation on simple regression task #4255

Closed aditya1503 closed 2 weeks ago

aditya1503 commented 3 weeks ago

Bug Report Checklist

Describe the bug When running the regression task multiple times with the equation y = 2*x+5, Autogluon 1.1.0 consistently performs worse compared to Autogluon 0.7.0.

Expected behavior I expected Autogluon 1.1.0 to perform at least as well as Autogluon 0.7.0 on this simple mathematical function regression task. (with a high R2_score)

To Reproduce I've prepared code snippets to reproduce the issue. Please find them below. Generate dataset + run AutoGluon Tabular:

import numpy as np
import pandas as pd
num_samples = 20000
main_x_min = 0
main_x_max = 100
few_x_min = 2000
few_x_max = 10000
few_samples_ratio = 0.0002  # 0.2% of samples above 2000

# Generate most x values between 0 and 100
main_x_samples = int(num_samples * (1 - few_samples_ratio))
x_main = np.random.uniform(main_x_min, main_x_max, main_x_samples)
x_main[-1] = -1.0
# Generate a few x values above 2000
few_x_samples = num_samples - main_x_samples
x_few = np.random.uniform(few_x_min, few_x_max, few_x_samples)

# Combine both ranges of x values
x = np.concatenate((x_main, x_few))

# Increase noise for x and generate noisy x values
x_noise = np.random.normal(1, 0.0, num_samples)  # smaller noise for x
x_noisy = x * x_noise

# Generate y values based on the function y = 2x + 5 with increased noise
y_noise = np.random.normal(1, 0.0, num_samples)  # smaller noise for y
y = (2 * x_noisy + 5) * y_noise
data = pd.DataFrame({'x': x_noisy, 'y': y})
data.to_csv('regression_dataset.csv', index=False)

from autogluon.tabular import TabularDataset, TabularPredictor
from autogluon.tabular import __version__
print("Autogluon tabular version:",__version__)

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
dataset = pd.read_csv('regression_dataset.csv')
training_dataset, non_training_dataset = train_test_split(dataset, test_size=0.3)
# Load training and non-training datasets
label = 'y'
train_data = TabularDataset(training_dataset)
test_data = TabularDataset(non_training_dataset)

# Fit the TabularPredictor on the training data
predictor = TabularPredictor(label=label,problem_type='regression').fit(train_data)

# Predictions on test data
predictions_test = predictor.predict(test_data)
test_labels = test_data[label]

# Calculate R2 score for test data
r2_test = r2_score(test_labels, predictions_test)
print("R2 Score test:", r2_test)

# Calculate MSE for test data
mse_test = mean_squared_error(test_labels, predictions_test)
print("MSE test:", mse_test)

# Predictions on train data
predictions_train = predictor.predict(train_data)
train_labels = train_data['y']

# Calculate R2 score for train data
r2_train = r2_score(train_labels, predictions_train)
print("R2 Score train:", r2_train)

# Calculate MSE for train data
mse_train = mean_squared_error(train_labels, predictions_train)
print("MSE train:", mse_train)

Screenshots / Logs AutoGluon 0.7.0's performance

Screenshot 2024-06-08 at 12 30 35 AM

AutoGluon 1.1.0's performance

Screenshot 2024-06-08 at 12 31 07 AM

Installed Versions

INSTALLED VERSIONS
------------------
date                   : 2024-06-08
time                   : 00:34:13.849737
python                 : 3.10.14.final.0
OS                     : Linux
OS-release             : 6.5.0-35-generic
Version                : #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2
machine                : x86_64
processor              : x86_64
num_cores              : 8
cpu_ram_mb             : 64198.94921875
cuda version           : 12.525.147.05
num_gpus               : 2
gpu_ram_mb             : [7790, 11163]
avail_disk_size_mb     : 34986

accelerate             : 0.21.0
autogluon              : None
autogluon.common       : 1.1.0
autogluon.core         : 1.1.0
autogluon.features     : 1.1.0
autogluon.multimodal   : 1.1.0
autogluon.tabular      : 1.1.0
boto3                  : 1.34.69
catboost               : 1.2.5
defusedxml             : 0.7.1
evaluate               : 0.4.2
fastai                 : 2.7.15
hyperopt               : 0.2.7
imodels                : None
jinja2                 : 3.1.3
jsonschema             : 4.21.1
lightgbm               : 4.1.0
lightning              : 2.1.4
matplotlib             : 3.7.1
networkx               : 3.3
nlpaug                 : 1.1.11
nltk                   : 3.8.1
nptyping               : 2.4.1
numpy                  : 1.24.4
nvidia-ml-py3          : 7.352.0
omegaconf              : 2.2.3
onnxruntime-gpu        : None
openmim                : 0.3.9
pandas                 : 2.0.0
pdf2image              : 1.17.0
Pillow                 : 10.2.0
psutil                 : 5.9.4
pytesseract            : 0.3.10
pytorch-metric-learning: 2.3.0
ray                    : 2.10.0
requests               : 2.28.2
scikit-image           : 0.20.0
scikit-learn           : 1.3.0
scikit-learn-intelex   : None
scipy                  : 1.9.1
seqeval                : 1.2.2
setuptools             : 60.2.0
skl2onnx               : None
tabpfn                 : None
tensorboard            : 2.16.2
text-unidecode         : 1.3
timm                   : 0.9.16
torch                  : 2.1.2+cpu
torchmetrics           : 1.2.1
torchvision            : 0.16.2+cpu
tqdm                   : 4.65.2
transformers           : 4.38.2
vowpalwabbit           : 8.10.1
xgboost                : 2.0.3

Innixma commented 2 weeks ago

Please provide 2 Colab notebook links so we can reproduce this.

To me, it looks like your 0.7 run was not using the same data generation logic, and did not include the noise. With the noise added, it should be impossible to achieve the score reported.

Additionally, the code provided does not use fixed seeds in both the generation and train/test split, making the scores not comparable.

Running with AutoGluon 1.1.0 I get good results:

R2 Score test: 0.9962127006040331
MSE test: 137.0034214286692
R2 Score train: 0.9916397365538735
MSE train: 493.6924675927423

Because you create a few outlier samples with much larger x/y, those dominate the loss calculations, and since the seed is not fixed, this results in the difference between runs.

Please re-open the issue if you still find major differences after resolving the above issues.