Potential issue with data splitting for Solar dataset

abdulfatir commented 1 year ago

Hi!

Thank you for releasing your code!

I have some questions regarding data splitting for the solar dataset as implemented in this notebook. Here's the relevant portion of the code with some comments of mine.

import numpy as np
from gluonts.dataset.repository.datasets import get_dataset
from gluonts.dataset.multivariate_grouper import MultivariateGrouper

dataset = get_dataset("solar-energy", regenerate=False)

train_grouper = MultivariateGrouper(
    max_target_dim=min(
        2000, int(dataset.metadata.feat_static_cat[0].cardinality)
    )
)

test_grouper = MultivariateGrouper(
    num_test_dates=int(len(dataset.test) / len(dataset.train)),
    max_target_dim=min(
        2000, int(dataset.metadata.feat_static_cat[0].cardinality)
    ),
)

train = [i for i in train_grouper(dataset.train)]
test = [i for i in test_grouper(dataset.test)]
train = train[0]["target"].transpose(1, 0)
test = test[0]["target"].transpose(1, 0)
print(train.shape, test.shape) # <-- prints (7009, 137) (7033, 137)

# Note that the first 7009 timesteps in test are the same as train
print(np.allclose(train, test[:7009])) # <-- prints True

# Concatenate data along the "time" dimension.
data = np.concatenate([train, test], axis=0)
print(data.shape) # <-- prints (14042, 137)

# Truncate the data along the time and feature dimensions
# 14016 = 7009 + 7007
# Why ignore the first 9 dimensions?
data = data[0:14016, 9:]
print(data.shape) # <-- prints (14016, 128)

# Split into 73 chunks and take the first 65 for training
# Thus, the training data goes upto 65 * 192 = 12480
# However, these 12480 timesteps actually include the timesteps
# that are supposedly "reserved" for testing in "test"
data = np.split(data, 73, 0)
data = np.array(data)
train = data[0:65]
test = data[65:]

Based on my comments in the code above, I have the following questions:

Is there test data leakage in this experiment setup? Could this be the reason for the significant difference between baselines and SSSD in Table 4 of your paper?
Your experiment only uses the last 128 (of 137) feature dimensions as seen in the code above and also described in the appendix of your paper. Why is this the case? More importantly, how are your results then comparable to any of the baseline methods that actually use 137 features? My understanding is that you have reproduced baseline results in table 4 from the CSDI paper.

Looking forward to your clarifications.

abdulfatir commented 1 year ago

The following code checks if test is actually a part of train:

train = train.reshape(-1, 128)
test = test.reshape(-1, 128)

# 5471 = 65 * 192 - 7009
np.allclose(train[5471 : 5471 + len(test)], test) # <--- True

juanlopezcode commented 1 year ago

Hello @abdulfatir , I thank you for your observation, unfortunatelly, you are right, the truth is that I wasn't aware about the data leakage coming from the respetive 'train' and 'test' groupers. I followed these GluonTS implementations and that made me believe that was correct: https://github.com/jeffrey82221/gluonts_fund_price_forecast/blob/fed7c484c4dba663201f9cf96aa86ca98119b54c/reference/pytorch_ts_examples/multivariate.py and https://github.com/yantijin/ScoreGradPred/tree/8fdce6349f0d34c651837b6b5e3c4024f223f360.

I did a new training with a new data pre-processing which only uses data from one grouper, and also added extra channels to match the channel split which I left out for evaluation, so we use all channels for training (I will update the jupyter notebook in this repo soon). We achieved 5.03e2 ± 1.06e1 after 200,000 iterations. Which fortunatelly is still significat. Again, thank you for your observation and apologies for the inconvenience.

abdulfatir commented 1 year ago

Thanks for the update, @juanlopezcode! Just to clarify, the repos that you referred to appear to be doing things correctly. Anyway, it's good to see the new results. Thank you for sharing them. I would also suggest that you update the paper accordingly and also verify if such data splitting issues exist in other experiments. Cheers!

AI4HealthUOL / SSSD

Potential issue with data splitting for Solar dataset #8