crflynn / skgrf

scikit-learn compatible Python bindings for grf (generalized random forests) C++ random forest library
https://skgrf.readthedocs.io/en/stable/
GNU General Public License v3.0
30 stars 6 forks source link

Allow missing values in X? #77

Closed erikcs closed 2 years ago

erikcs commented 3 years ago

All grf forests (except local linear) support splitting with missing X values (IEEE NaNs) https://grf-labs.github.io/grf/REFERENCE.html#missing-values.

Ideally it should only require a light change in the wrappers when doing fit/predict: calling into sklearn.utils.check_X_y/sklearn.utils.check_array instead of _validate_input and passing force_all_finite='allow-nan' for X (though have to be sure other wrapper logic still works). I can send a PR later if interest.

crflynn commented 3 years ago

I believe we can achieve this by passing that same param into _validate_input, which calls check_X or check_X_y implicitly and passes **check_params on down. Working on a PR now.

erikcs commented 3 years ago

Yes, but the issue with that is only X should allow NaN, not the response y?

erikcs commented 3 years ago

Nevermind, didn't read the docstring carefully:

Whether to raise an error on np.inf, np.nan, pd.NA in X. This parameter does not influence whether y can have np.inf, np.nan, pd.NA values

crflynn commented 3 years ago

https://github.com/crflynn/skgrf/pull/79 this should handle it.