Regarding scaling of data

KarthikaKP commented 5 years ago

I have seen that standardscaler.fit(X) is being used which which scale the entire data.But the usual practice is to fit on the training data and apply the same mean on testing and validation data set.I am new to this feild and doesnt know how to preprocess time series data.Kindly reply

Seanny123 commented 5 years ago

You are absolutely correct and this is an embarrassing mistake which should be corrected.

mikel-brostrom commented 4 years ago

Ill leave this piece of code here if somebody needs to solve this issue and want to reuse the output scaler to inverse transform the predictions:

`def preprocess_data(dat, col_names, train_percentage):

# read dataset. Shape: (40560, 82)
proc_dat = dat.to_numpy()  

# create one dedicated scaler for the input data 
# and one for the output data
in_data_scaler = MinMaxScaler() 
out_data_scaler = MinMaxScaler()

# separate target from features: (40560, 1) | (40560, 81)
mask = np.ones(proc_dat.shape[1], dtype=bool)
dat_cols = list(dat.columns)
for col_name in col_names:
    mask[dat_cols.index(col_name)] = False

feats = proc_dat[:, mask]
targs = proc_dat[:, ~mask]

# fit the scalers on train set only
train_size = int(train_percentage * len(dat))
in_data_scaler.fit(feats[0:train_size - 1, :])
out_data_scaler.fit(targs[0:train_size - 1, :])

# transform features and targets for model training
feats = in_data_scaler.transform(feats)
targs = out_data_scaler.transform(targs)

return feats, targs, in_data_scaler, out_data_scaler`

Seanny123 / da-rnn

Regarding scaling of data #5