Describe the bug

When finetuning chronos-t5-small on the ETTh1 dataset and ETTh2 dataset respectively, the performance drops compared to the zeroshot performance. Could that be the case because the prediction_lengthis recommended to be <=64?

Expected behavior

If the model chronos-t5-small is finetuned on let's say the dataset ETTh1 only, the finetuned model should yield superior MAE and MSE performance compared to the zeroshot model.

How to reproduce

This example is focused on the ETTh1 dataset. For the ETTh2 dataset, the procedure is identical. Please note that the finetuning evaluation for my experiments is done individually for both datasets, so the model is not finetuned and evaluated on both datasets at once.

Standardize ETTh dataset and convert it into arrow format

def convert_to_arrow(
    path: Union[str, Path],
    time_series: Union[List[np.ndarray], np.ndarray],
    compression: str = "lz4",
):
    """
    Store a given set of series into Arrow format at the specified path.

    Input data can be either a list of 1D numpy arrays, or a single 2D
    numpy array of shape (num_series, time_length).
    """
    assert isinstance(time_series, list) or (
        isinstance(time_series, np.ndarray) and
        time_series.ndim == 2
    )

    # Set an arbitrary start time
    start = np.datetime64("2016-07-01 00:00:00", "s")

    dataset = [
        {"start": start, "target": ts} for ts in time_series
    ]

    ArrowWriter(compression=compression).write_to_file(
        dataset,
        path=path,
    )

if name == "main":

Load and preprocess the dataset

df = pd.read_csv('/path/to/dataset')

df=df[0:12194]

# Ensure time column is in datetime format
time_column = 'date'
df[time_column] = pd.to_datetime(df[time_column])

# Define feature columns
feature_columns = ['HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL', 'OT']

# Standardize the feature columns
scaler = StandardScaler()
df[feature_columns] = scaler.fit_transform(df[feature_columns])
df['id'] = 0

# Create the structured DataFrame
structured_df = df[['id', time_column] + feature_columns].rename(columns={time_column: 'timestamp'})

# Extract the time series and start times
time_series = [structured_df[col].to_numpy() for col in feature_columns]
start_times = [np.datetime64(structured_df['timestamp'].iloc[0], 's')] * len(feature_columns)

Finetune the model

Use training pipeline implemented in file chronos-forecasting/scripts/training/train.py shown in the tutorial with the following config chronos-t5-small.yaml:


training_data_paths:
- "path/to/ETTh1_train.arrow"
probability:
- 1.0
context_length: 512
prediction_length: 96
min_past: 60
max_steps: 200_000
save_steps: 20_000
log_steps: 500
per_device_train_batch_size: 16
learning_rate: 0.001
optim: adamw_torch_fused
num_samples: 1
shuffle_buffer_length: 50_000
gradient_accumulation_steps: 1
model_id: google/t5-efficient-small
model_type: seq2seq
random_init: true
tie_embeddings: true
output_dir: ./output/etth1/
tf32: true
torch_compile: false
tokenizer_class: "MeanScaleUniformBins"
tokenizer_kwargs:
  low_limit: -15.0
  high_limit: 15.0
n_tokens: 4096
lr_scheduler_type: linear
warmup_ratio: 0.0
dataloader_num_workers: 1
max_missing_prop: 0.9
use_eos_token: true

Evaluate finetuned model on the final checkpoint with the evaluation pipeline implemented in chronos-forecasting/scripts/evaluation/evaluate.py with the following modifications in the load_and_split_dataset()function:

def load_and_split_dataset(backtest_config: dict):
    hf_repo = backtest_config["hf_repo"]
    dataset_name = backtest_config["name"]
    offset = backtest_config["offset"]
    prediction_length = backtest_config["prediction_length"]
    num_rolls = backtest_config["num_rolls"]

    ds=Dataset.from_file("/path/to/ETTh1.arrow")

    ds.set_format("numpy")

    gts_dataset = to_gluonts_univariate(ds)

    # Split dataset for evaluation
    _, test_template = split(gts_dataset, offset=offset)
    test_data = test_template.generate_instances(prediction_length, windows=num_rolls, distance=1)

    return test_data

and following metrics:

metrics = (
            evaluate_forecasts(
                sample_forecasts,
                test_data=test_data,
                metrics=[
                    MAE(),
                    MSE(),
                ],
                batch_size=5000,
            )
            .reset_index(drop=True)
            .to_dict(orient="records")
        )

The evaluation is performed on the test section of the standardized ETTh1 dataset, hence the offset. For the evaluation pipeline, use the following config:

- name: ETTh
  hf_repo: autogluon/chronos_datasets_extra
  offset: -1742
  prediction_length: 96
  num_rolls: 1135

Compare the evaluation of the finetuned model with the default model chronos-t5-small (which has not been trained on the ETTh1 dataset).

I get the following results:

Zeroshot ETTh1 dataset,model,MAE[0.5],MSE[mean] ETTh,amazon/chronos-t5-small,0.5081954018184918,0.560689815315581

Zeroshot ETTh2 dataset,model,MAE[0.5],MSE[mean] ETTh,amazon/chronos-t5-small,0.2625630043626757,0.1391419442914831

Finetuned ETTh1 dataset,model,MAE[0.5],MSE[mean] ETTh,/path/to/checkpoint-final,0.7746078180721628,1.1865953634689008

Finetuned ETTh2 dataset,model,MAE[0.5],MSE[mean] ETTh,/path/to/checkpoint-final,0.35415831866543424,0.2516080298962922

As you can see, MAE and MSE are worse for the finetuned checkpoint than for the default model. That shouldn't be the case.

Environment description Operating system: Ubuntu 22.04.4 LTS Python version: 3.10.14 CUDA version: 12.4 PyTorch version: 2.4.0 HuggingFace transformers version: 4.44.2 HuggingFace accelerate version: 0.33.0

amazon-science / chronos-forecasting

Finetuning Chronos on ETTh datasets yields poor performance #209

Load and preprocess the dataset