crflynn / skranger

scikit-learn compatible Python bindings for ranger C++ random forest library
https://skranger.readthedocs.io/en/stable/
GNU General Public License v3.0
51 stars 7 forks source link

Strange behaviour in `predict_quantiles` #54

Closed r3v1 closed 3 years ago

r3v1 commented 3 years ago

Hi, I've been testing the Ranger Forest Regressor, and I noticed a strange beahaviour when predicting quantiles: it outputs int like values or rounded to .5 (19., 2., 3.5) and not the expected float (19.65165, etc)

To reproduce this:

import numpy as np
x_train = np.array([[1.88433, 1.68713, 1.64588, 1.97248, 2.58555],
       [1.66566, 1.61085, 1.58335, 1.83385, 2.39287],
       [1.44698, 1.53457, 1.52082, 1.69523, 2.20019],
       [1.22831, 1.45828, 1.45828, 1.5566 , 2.00751],
       [1.28143, 1.53984, 1.58109, 1.68091, 2.05723]])

y_train = np.array([11.868,  9.312, 19.222, 28.563, 25.402])

params = {
'n_estimators': 2500,
 'n_jobs': -1,
 'quantiles': True,
 'alpha': 1,
 'always_split_features': None,
 'categorical_features': None,
 'holdout': False,
 'importance': 'none',
 'inbag': None,
 'keep_inbag': False,
 'local_importance': False,
 'max_depth': 0,
 'min_node_size': 10,
 'minprop': 0.1,
 'mtry': 0,
 'num_random_splits': 1,
 'oob_error': False,
 'regularization_factor': [1],
 'regularization_usedepth': False,
 'replace': True,
 'respect_categorical_features': None,
 'sample_fraction': None,
 'save_memory': False,
 'scale_permutation_importance': False,
 'seed': 42,
 'split_rule': 'variance',
 'split_select_weights': None,
 'verbose': False
}

rfr = RangerForestRegressor(**params)
rfr.fit(x_train, y_train)

Then, I try to predict with quantiles:

rfr.predict_quantiles(x_pred)
# array([[ 9.,  9.,  9.,  9.,  9.],
       [19., 19., 19., 19., 19.],
       [28., 28., 28., 28., 28.]])

Without quantiles:

rfr.predict(x_pred)
# array([18.78459336, 18.78459336, 18.78459336, 18.78459336, 18.78459336])

In fact, the dataset used contains thousands of instances, so its not a problem regarding the size of the dataset.

Thanks!

crflynn commented 3 years ago

I ported the code from R here: https://github.com/imbs-hl/ranger/blob/e8b05f47892bb4968c4e6057f68b35bcd0b52225/R/ranger.R#L972 and I think it's just a mistake of casting to int in python here: https://github.com/crflynn/skranger/blob/aa2b5540b0b386321610ba10a449d11281a60e2e/skranger/ensemble/ranger_forest_regressor.py#L229

I noticed this too originally but for some reason didn't think twice about it. I think we can just remove the astype and it should be corrected.

r3v1 commented 3 years ago

I also tried what you have said about removingastype but it it raises IndexError: arrays used as indices must be of integer (or boolean) type

crflynn commented 3 years ago

Right that's not it actually I need to take a closer look.

crflynn commented 3 years ago

It's actually here where it creates the array: https://github.com/crflynn/skranger/blob/aa2b5540b0b386321610ba10a449d11281a60e2e/skranger/ensemble/ranger_forest_regressor.py#L311

It's being created as an integer array, so the subsequent steps are doing int coercion leading to the strange quantile results.

Try changing this to

        node_values = 0.0 * terminal_nodes
r3v1 commented 3 years ago

That's it! Thanks