Fix batch transform issue for tabular predictor with multiple partitions

Description:

This PR fixes the issue where batch transform jobs fail due to column misalignment when the input CSV file is partitioned into multiple records. The problem arises because headers from different partitions are not handled properly, leading to misaligned columns and prediction failures during inference.

Changes:

Added logic to align columns across partitions by ensuring headers are managed correctly.
Introduced _read_with_fallback and _align_columns helper functions to handle column alignment.
Updated transform_fn in tabular_serve.py to use these helper functions.

Limitations:

This fix currently only works for the tabular predictor. Support for multimodal and timeseries predictors depends on the implementation of original_features, which can be tracked in issue #4477.

Steps to Reproduce: The following script can be used to reproduce the issue:

from autogluon.cloud import TabularCloudPredictor
import pandas as pd

# Load datasets
train_data = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv")
test_data = pd.read_csv("https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv")
test_data.drop(columns=['class'], inplace=True)

# Cloud Predictor Arguments
predictor_init_args = {"label": "class"}  
predictor_fit_args = {"train_data": train_data, "time_limit": 60}  

# Initialize Cloud Predictor and Fit
cloud_predictor = TabularCloudPredictor(cloud_output_path='tonyhu-autogluon')
cloud_predictor.fit(predictor_init_args=predictor_init_args, predictor_fit_args=predictor_fit_args)

# Batch Inference with small max_payload to force multiple partitions
result = cloud_predictor.predict(test_data, backend_kwargs={"transformer_kwargs": {"max_payload": 1}})

Expected Behavior: The batch transform job should handle multiple partitions correctly, aligning columns across the partitions and ignoring or managing headers if present in individual partitions.

Observed Behavior: The job fails with the following error logs:

Bad HTTP status received from algorithm: 500
invalid literal for int() with base 10: '0.1': Error while type casting for column 'capital-loss'

Logs show that the columns are misaligned for certain partitions:

test_columns: [' 11th', ' Machine-op-inspct', ' Male', ' Never-married', ' Other-relative', ' Private', ' United-States', ' White', '0', '0.1', '207443', '50', '62', '7']
2024-09-13T21:56:19,062 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - train_columns: ['age', 'capital-gain', 'capital-loss', 'education', 'education-num', 'fnlwgt', 'hours-per-week', 'marital-status', 'native-country', 'occupation', 'race', 'relationship', 'sex', 'workclass']

Environment:

autogluon==1.1.0
Running batch transform in SageMaker with MultiRecord strategy.
MaxPayloadInMB=1 is set to ensure multiple partitions.

Additional Information: The issue seems to be that AutoGluon Cloud is not handling the headers properly when dealing with batch transform partitioned records. In a multi-partition job, not all batches will have the header/column, which is causing the column misalignment.

Note: This fix currently only works for the tabular predictor. Support for multimodal and timeseries predictors depends on the implementation of original_features, which can be tracked in issue #4477.

autogluon / autogluon-cloud

Fix batch transform issue for tabular predictor with multiple partitions #138

Description: