scaling of input data - Githubissues

I have had a quick review of the benchmarking pipeline to better understand how the comparison is performed. During that review I noticed the scaling is always performed while reading the data files using a RobustScaler from sklearn.

https://github.com/EpistasisLab/srbench/blob/1ad633974c9126a8eb6ce936873d2e9b3d40294c/experiment/read_file.py#L32

The actual model is generated in the evaluate model script, which additionally has a parameter scale_x and scale_y that determine whether the input data X and target y should be scaled.

https://github.com/EpistasisLab/srbench/blob/1ad633974c9126a8eb6ce936873d2e9b3d40294c/experiment/evaluate_model.py#L46-L61

This means that if scale_x is set to true the input data is scaled twice when using the benchmarking pipeline. I don't know if this behavior is intended, but I suspect that the RobustScaler is an artifact from previous experimentation and should be removed. Otherwise, although I set the scale_x parameter to false, scaling is performed while reading the data.

cavalab / srbench

scaling of input data #26