cavalab / srbench

A living benchmark framework for symbolic regression
https://cavalab.org/srbench/
GNU General Public License v3.0
216 stars 77 forks source link

scaling of input data #26

Closed mkommend closed 3 years ago

mkommend commented 3 years ago

I have had a quick review of the benchmarking pipeline to better understand how the comparison is performed. During that review I noticed the scaling is always performed while reading the data files using a RobustScaler from sklearn.

https://github.com/EpistasisLab/srbench/blob/1ad633974c9126a8eb6ce936873d2e9b3d40294c/experiment/read_file.py#L32

The actual model is generated in the evaluate model script, which additionally has a parameter scale_x and scale_y that determine whether the input data X and target y should be scaled.

https://github.com/EpistasisLab/srbench/blob/1ad633974c9126a8eb6ce936873d2e9b3d40294c/experiment/evaluate_model.py#L46-L61

This means that if scale_x is set to true the input data is scaled twice when using the benchmarking pipeline. I don't know if this behavior is intended, but I suspect that the RobustScaler is an artifact from previous experimentation and should be removed. Otherwise, although I set the scale_x parameter to false, scaling is performed while reading the data.

lacava commented 3 years ago

that is indeed an artifact, thanks for catching!