ML4ITS / mtad-gat-pytorch

PyTorch implementation of MTAD-GAT (Multivariate Time-Series Anomaly Detection via Graph Attention Networks) by Zhao et. al (2020, https://arxiv.org/abs/2009.02040).
MIT License
328 stars 76 forks source link

The reason why use shuffle in time-series data #26

Closed cloudhs7 closed 1 year ago

cloudhs7 commented 1 year ago

Hi. Thanks for your wonderful work!

I'm curious about the reason why 'shuffle = True' is default option in this implementation below, because the data is time-series data.

def create_data_loaders(train_dataset, batch_size, val_split=0.1, shuffle=True, test_dataset=None):

Is there any reason why shuffle the time-series data? (or even if shuffled data can get the time-oriented features in GAT?)

srigas commented 1 year ago

It is true that when dealing with time-series data you rarely shuffle your data in order for the temporal order to be learned by the model. However, this is not always the case when handling time-series in a sliding window approach. In this regime, you don't treat a single timestamp as a data point; rather, you treat w consecutive timestamps as a single data point and use them for your prediction (in this case, forecasting of the next value and reconstruction of the measured value). For this reason, shuffling your data is okay, as you have essentially split a single time-series into several smaller ones. On one hand, this feels like inducing some data leakage into your training set (for example, two data points with 80% same timestamps could end up one in the training and one in the validation set), but on the other hand your model may be trained faster and sometimes more efficiently.