Open Sohrabbeig opened 9 months ago
we use the min/max value of training sets to normalize the validation/test sets
When I look at the code at data_loader.py
, the full data is being used to normalize:
I guess it should have been like the following instead:
training_end = int(len(data) * self.train_ratio)
mms.fit(data[:training_end])
data = mms.transform(data)
Thanks for the correction, we've revised it
When I look at the data_loader.py,the Dataset_Wiki and Dataset_Solar class should use self.data instead of data so that fit the original code original: self.data = mms.fit_transform(self.data)
fixed: if type == '1': mms = MinMaxScaler(feature_range=(0, 1)) training_end = int(len(self.data) * self.train_ratio) mms.fit(self.data[:training_end]) self.data = mms.transform(self.data)
I've been reviewing the data preprocessing steps in
data/data_loader.py
and noticed that the entire dataset undergoes fitting and transformation before being split into training, validation, and test sets. This process might lead to data leakage, where information from the test and validation sets inadvertently influences the training process.Is this approach an intentional part of the model's design for a specific reason that I might have missed? Or could it be an oversight?