Potential Data Leakage in data/data_loader.py due to Preprocessing Before Split

aikunyi / FourierGNN

Official implementation of the paper "FourierGNN: Rethinking Multivariate Time Series Forecasting from a Pure Graph Perspective"

MIT License

176 stars 18 forks source link

Potential Data Leakage in data/data_loader.py due to Preprocessing Before Split #10

Open Sohrabbeig opened 10 months ago

Sohrabbeig commented 10 months ago

I've been reviewing the data preprocessing steps in data/data_loader.py and noticed that the entire dataset undergoes fitting and transformation before being split into training, validation, and test sets. This process might lead to data leakage, where information from the test and validation sets inadvertently influences the training process.

Is this approach an intentional part of the model's design for a specific reason that I might have missed? Or could it be an oversight?

aikunyi commented 10 months ago

we use the min/max value of training sets to normalize the validation/test sets

yasinuygun commented 9 months ago

When I look at the code at data_loader.py, the full data is being used to normalize:

I guess it should have been like the following instead:

            training_end = int(len(data) * self.train_ratio)
            mms.fit(data[:training_end])
            data = mms.transform(data)

aikunyi commented 9 months ago

Thanks for the correction, we've revised it

tolinlaws commented 8 months ago

When I look at the data_loader.py,the Dataset_Wiki and Dataset_Solar class should use self.data instead of data so that fit the original code original: self.data = mms.fit_transform(self.data)

fixed: if type == '1': mms = MinMaxScaler(feature_range=(0, 1)) training_end = int(len(self.data) * self.train_ratio) mms.fit(self.data[:training_end]) self.data = mms.transform(self.data)