autogluon / autogluon-cloud

Autogluon-cloud aims to provide user tools to train, fine-tune and deploy AutoGluon backed models on the cloud. With just a few lines of codes, users could train a model and perform inference on the cloud without worrying about MLOps details such as resource management
Apache License 2.0
18 stars 12 forks source link

Fix batch transform issue for tabular predictor with multiple partitions #138

Closed tonyhoo closed 2 months ago

tonyhoo commented 2 months ago

Description:

This PR fixes the issue where batch transform jobs fail due to column misalignment when the input CSV file is partitioned into multiple records. The problem arises because headers from different partitions are not handled properly, leading to misaligned columns and prediction failures during inference.

Changes:

Limitations:

Steps to Reproduce: The following script can be used to reproduce the issue:

from autogluon.cloud import TabularCloudPredictor
import pandas as pd

# Load datasets
train_data = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv")
test_data = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv")
test_data.drop(columns=['class'], inplace=True)

# Cloud Predictor Arguments
predictor_init_args = {"label": "class"}  
predictor_fit_args = {"train_data": train_data, "time_limit": 60}  

# Initialize Cloud Predictor and Fit
cloud_predictor = TabularCloudPredictor(cloud_output_path='tonyhu-autogluon')
cloud_predictor.fit(predictor_init_args=predictor_init_args, predictor_fit_args=predictor_fit_args)

# Batch Inference with small max_payload to force multiple partitions
result = cloud_predictor.predict(test_data, backend_kwargs={"transformer_kwargs": {"max_payload": 1}})

Expected Behavior: The batch transform job should handle multiple partitions correctly, aligning columns across the partitions and ignoring or managing headers if present in individual partitions.

Observed Behavior: The job fails with the following error logs:

Bad HTTP status received from algorithm: 500
invalid literal for int() with base 10: '0.1': Error while type casting for column 'capital-loss'

Logs show that the columns are misaligned for certain partitions:

test_columns: [' 11th', ' Machine-op-inspct', ' Male', ' Never-married', ' Other-relative', ' Private', ' United-States', ' White', '0', '0.1', '207443', '50', '62', '7']
2024-09-13T21:56:19,062 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - train_columns: ['age', 'capital-gain', 'capital-loss', 'education', 'education-num', 'fnlwgt', 'hours-per-week', 'marital-status', 'native-country', 'occupation', 'race', 'relationship', 'sex', 'workclass']

Environment:

Additional Information: The issue seems to be that AutoGluon Cloud is not handling the headers properly when dealing with batch transform partitioned records. In a multi-partition job, not all batches will have the header/column, which is causing the column misalignment.

Note: This fix currently only works for the tabular predictor. Support for multimodal and timeseries predictors depends on the implementation of original_features, which can be tracked in issue #4477.