Ekeany / Boruta-Shap

A Tree based feature selection tool which combines both the Boruta feature selection algorithm with shapley values.
MIT License
559 stars 86 forks source link

[BUG] BorutaShap import error. [Main cause: Boston dataset] #118

Closed rodrigopasqualucci closed 9 months ago

rodrigopasqualucci commented 1 year ago

Describe the bug

When I try to import the lib BorutaShap and run the code on Jupyter notebook and error message shows up below the code cell informing that there's a import error.

To Reproduce

Steps to reproduce the behavior:

  1. Type in the Jupyter code cell from BorutaShap import BorutaShap
  2. Run the code
  3. Show up an import error message below the code cell

Expected behavior

Importing the lib BorutaShap without any failure.

Screenshots


ImportError Traceback (most recent call last) in <cell line: 21>() 19 from sklearn.metrics import roc_auc_score 20 from boruta import BorutaPy ---> 21 from BorutaShap import BorutaShap 22 from matplotlib import pyplot as plt 23 from preprocessing.tratamentos_categoricas import RankCountVectorizer

1 frames /usr/local/lib/python3.10/dist-packages/sklearn/datasets/init.py in getattr(name) 154 """ 155 ) --> 156 raise ImportError(msg) 157 try: 158 return globals()[name]

ImportError: load_boston has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as investigated in [1], the authors of this dataset engineered a non-invertible variable "B" assuming that racial self-segregation had a positive impact on house prices [2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original source::

import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the Ames housing dataset. You can load the datasets as follows::

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

for the California housing dataset and::

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle. "Racist data destruction?" https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8

[2] Harrison Jr, David, and Daniel L. Rubinfeld. "Hedonic housing prices and the demand for clean air." Journal of environmental economics and management 5.1 (1978): 81-102. https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air

Additional context

The Import error message explains that there's a problem with the Boston dataset (it's been removed from scikit-learn since version 1.2), however, I'm not loading the Boston dataset in my code, just the BorutaShap lib.

dandavies99 commented 1 year ago

FYI @rodrigopasqualucci see discussion in #111

rodrigopasqualucci commented 1 year ago

@dandavies99 Thank you very much. I will definitely take a look!