Ekeany / Boruta-Shap

A Tree based feature selection tool which combines both the Boruta feature selection algorithm with shapley values.
MIT License
560 stars 86 forks source link

[BUG] Histogram Gradient Boosted Trees produce errors with missing values in the dataset #122

Open cvraut opened 8 months ago

cvraut commented 8 months ago

Describe the bug

Histogram Gradient Boosted Trees produce errors with missing values in the dataset.

To Reproduce

Steps to reproduce the behavior:

Python 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from BorutaShap import BorutaShap, load_data
>>> from sklearn.ensemble import HistGradientBoostingRegressor
>>> import numpy as np
>>> 
>>> # load the test data
>>> X,y = load_data(data_type='regression')
>>> 
>>> # verify that the data was loaded correctly
>>> print(X.head())
        age       sex       bmi        bp        s1        s2        s3        s4        s5        s6
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401 -0.002592  0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412 -0.039493 -0.068332 -0.092204
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592  0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038  0.034309  0.022688 -0.009362
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142 -0.002592 -0.031988 -0.046641
>>> 
>>> # set the first value of bmi to np.nan to simulate missing data
>>> X.loc[0,'bmi'] = np.nan
>>> 
>>> # verify that the missingness was applied correctly
>>> print(X.head())
        age       sex       bmi        bp        s1        s2        s3        s4        s5        s6
0  0.038076  0.050680       NaN  0.021872 -0.044223 -0.034821 -0.043401 -0.002592  0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412 -0.039493 -0.068332 -0.092204
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592  0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038  0.034309  0.022688 -0.009362
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142 -0.002592 -0.031988 -0.046641
>>> 
>>> # demonstrate HistGradientBoostingRegressor can fit the data
>>> model = HistGradientBoostingRegressor()
>>> model.fit(X,y)
HistGradientBoostingRegressor()
>>> print(model.score(X,y))
0.9319421452721947
>>> 
>>> # try to perform feature selection with missing data
>>> Feature_Selector = BorutaShap(model=model,importance_measure='shap',classification=False)
>>> 
>>> Feature_Selector.fit(X=X, y=y, n_trials=10, sample=False,train_or_test = 'test', normalize=True,verbose=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/app/apps/rhel8/jupyter/2023-04/lib/python3.10/site-packages/BorutaShap-1.0.14-py3.10.egg/BorutaShap.py", line 442, in fit
  File "/app/apps/rhel8/jupyter/2023-04/lib/python3.10/site-packages/BorutaShap-1.0.14-py3.10.egg/BorutaShap.py", line 283, in check_missing_values
ValueError: There are missing values in your Data

Expected behavior

I expect BorutaShap to accept missing values in the X matrix if the model is the HistGradientBoostingRegressor or HistGradientBoostingClassifier from scikit-learn.

Additional context

Histogram Gradient Boosted Trees were introduced to scikit-learn in version 1.0.2 and were officially supported by the shape package in version 0.35.0. Histogram Gradient Boosted Trees, like XGBoost, catboost, and lightgbm trees, can also handle missing values natively although it uses a different technique

cvraut commented 8 months ago

Hi maintainers,

I also opened a pull request that should handle this issue. Please let me know if those changes seem reasonable.

Cheers, Chinmay