AI4HealthUOL / SSSD

Repository for the paper: 'Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models'
MIT License
273 stars 52 forks source link

Potential issue with data splitting for Solar dataset #8

Closed abdulfatir closed 1 year ago

abdulfatir commented 1 year ago

Hi!

Thank you for releasing your code!

I have some questions regarding data splitting for the solar dataset as implemented in this notebook. Here's the relevant portion of the code with some comments of mine.

import numpy as np
from gluonts.dataset.repository.datasets import get_dataset
from gluonts.dataset.multivariate_grouper import MultivariateGrouper

dataset = get_dataset("solar-energy", regenerate=False)

train_grouper = MultivariateGrouper(
    max_target_dim=min(
        2000, int(dataset.metadata.feat_static_cat[0].cardinality)
    )
)

test_grouper = MultivariateGrouper(
    num_test_dates=int(len(dataset.test) / len(dataset.train)),
    max_target_dim=min(
        2000, int(dataset.metadata.feat_static_cat[0].cardinality)
    ),
)

train = [i for i in train_grouper(dataset.train)]
test = [i for i in test_grouper(dataset.test)]
train = train[0]["target"].transpose(1, 0)
test = test[0]["target"].transpose(1, 0)
print(train.shape, test.shape) # <-- prints (7009, 137) (7033, 137)

# Note that the first 7009 timesteps in test are the same as train
print(np.allclose(train, test[:7009])) # <-- prints True

# Concatenate data along the "time" dimension.
data = np.concatenate([train, test], axis=0)
print(data.shape) # <-- prints (14042, 137)

# Truncate the data along the time and feature dimensions
# 14016 = 7009 + 7007
# Why ignore the first 9 dimensions?
data = data[0:14016, 9:]
print(data.shape) # <-- prints (14016, 128)

# Split into 73 chunks and take the first 65 for training
# Thus, the training data goes upto 65 * 192 = 12480
# However, these 12480 timesteps actually include the timesteps
# that are supposedly "reserved" for testing in "test"
data = np.split(data, 73, 0)
data = np.array(data)
train = data[0:65]
test = data[65:]

Based on my comments in the code above, I have the following questions:

Looking forward to your clarifications.

abdulfatir commented 1 year ago

The following code checks if test is actually a part of train:

train = train.reshape(-1, 128)
test = test.reshape(-1, 128)

# 5471 = 65 * 192 - 7009
np.allclose(train[5471 : 5471 + len(test)], test) # <--- True
juanlopezcode commented 1 year ago

Hello @abdulfatir , I thank you for your observation, unfortunatelly, you are right, the truth is that I wasn't aware about the data leakage coming from the respetive 'train' and 'test' groupers. I followed these GluonTS implementations and that made me believe that was correct: https://github.com/jeffrey82221/gluonts_fund_price_forecast/blob/fed7c484c4dba663201f9cf96aa86ca98119b54c/reference/pytorch_ts_examples/multivariate.py and https://github.com/yantijin/ScoreGradPred/tree/8fdce6349f0d34c651837b6b5e3c4024f223f360.

I did a new training with a new data pre-processing which only uses data from one grouper, and also added extra channels to match the channel split which I left out for evaluation, so we use all channels for training (I will update the jupyter notebook in this repo soon). We achieved 5.03e2 ± 1.06e1 after 200,000 iterations. Which fortunatelly is still significat. Again, thank you for your observation and apologies for the inconvenience.

abdulfatir commented 1 year ago

Thanks for the update, @juanlopezcode! Just to clarify, the repos that you referred to appear to be doing things correctly. Anyway, it's good to see the new results. Thank you for sharing them. I would also suggest that you update the paper accordingly and also verify if such data splitting issues exist in other experiments. Cheers!