Data leakage train test split

Y-debug-sys / Diffusion-TS

[ICLR 2024] Official Implementation of "Diffusion-TS: Interpretable Diffusion for General Time Series Generation"

MIT License

184 stars 26 forks source link

Data leakage train test split #67

Closed mspils closed 3 weeks ago

mspils commented 3 weeks ago

Hi, I think you have data leakage in your divide function in real_datasets.py. You first create sliding windows and then you shuffle them. But due to the overlap between the windows you model has information about all the samples in the test set. Example:

data = [0,1,2,3,4,5,6,7,8,9,10,11]
window_size = 3
ration = 0.7

train_data = [
  [0,1,2]
  [6,7,8]
  [4,5,6]
  [5,6,7]
  [9,10,11]
  [1,2,3]
  [8,9,10]
]
test_data = [
  [7,8,9]
  [3,4,5]
  [2,3,4]
]

As you can see each sample in the test_data is at least partly already in the train_data.

In a Time series setting you would usually split a specific point in time and use data from before that to train.

Is there a reason not to do that here?

Y-debug-sys commented 3 weeks ago

You are correct, but the data in our paper was not shuffled (the first 90% for training, the last 10% for testing), and all comparison methods used the same preprocessing approach. Of course, you can choose to disable shuffling (we have provided the option, but not default). Additionally, this was only done for certain datasets (Energy and ETTh); for MUJOCO and Solar, please refer to the jupter notebook file.

Y-debug-sys commented 3 weeks ago

Now I have changed the default setting, so it's fixed.

mspils commented 3 weeks ago

Thanks!

lixus7 commented 3 weeks ago

Hello, thank you for your continued response and support.

I have the same questions. Your training stage uses the entire dataset, like how ETTh has 17397 samples (ETTh has 17420 rows), and it's set to 1 in etth.yaml, which means it’s using all the data for training. In time series forecasting domain ( (like STGCN for traffic forecasting, Informer for long-term forecasting)), even with unsupervised learning, they usually avoid overlap between training and testing data. But here, it looks like the training data includes all the samples from the dataset. Is this the same setting used in TimeGAN? And are all time series generation tasks set up like this?

Y-debug-sys commented 3 weeks ago

Hi, unconditional generation does not need train-test dividing, but for conditional generation, we set train: test = 9:1.

lixus7 commented 2 weeks ago

Thank you for your reply.

I am reproducing your unconditional generation results under ETTh. Except for the Predictive Score, which is better than the paper, the other three metrics are worse than the paper. For example, for Context-FID Score, I ran multiple seeds and metric_pytorch multiple times (each time was different), and the best result was 0.135± (worse than 0.116 in the paper). Would you mind to provide the best parameters for ETTh? I want to reproduce the results almost the same as the paper. I will really appreciate it If you can provide the best params.

Y-debug-sys commented 2 weeks ago

Hi, lixus7. Thank you for reaching out and for your effort in reproducing the results. Regarding the ETTh dataset (or other datasets), the parameters you're currently using are indeed the one we used to obtain the results reported in the paper. We observed similar fluctuations during our experiments before, which could explain the differences. To minimize variability, we ran all baselines using fixed data (divide) and a consistent evaluation model (TS2Vec, Classifier, and Predictor), which helped achieve stable comprisons in our experiments.