csinva / imodels

Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).
https://csinva.io/imodels
MIT License
1.36k stars 121 forks source link

ExtraBasicDiscretizer not working with scikit-learn 1.4 #211

Open jose-matos opened 4 weeks ago

jose-matos commented 4 weeks ago

First thank you for work. I appreciate it. :-)

I run the tutorial and there is a single example that does not work, the example that uses the ExtraBasicDiscretizer:

disc = ExtraBasicDiscretizer(feat_names[:3], n_bins=3, strategy='uniform')
X_train_brl_df = disc.fit_transform(pd.DataFrame(X_train[:, :3], columns=feat_names[:3]))
X_test_brl_df = disc.transform(pd.DataFrame(X_test[:, :3], columns=feat_names[:3]))

The problem occurs in the second and third lines:

When calling X_train_brl_df = disc.fit_transform(pd.DataFrame(X_train[:, :3], columns=feat_names[:3])) I get:

[/usr/lib64/python3.13/site-packages/sklearn/preprocessing/_discretization.py:248](http://localhost:8888/usr/lib64/python3.13/site-packages/sklearn/preprocessing/_discretization.py#line=247): FutureWarning: In version 1.5 onwards, subsample=200_000 will be used by default. Set subsample explicitly to silence this warning in the mean time. Set subsample=None to disable subsampling explicitly.
  warnings.warn(

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[17], line 1
----> 1 X_train_brl_df = disc.fit_transform(pd.DataFrame(X_train[:, :3], columns=feat_names[:3]))

File [/usr/lib64/python3.13/site-packages/sklearn/utils/_set_output.py:295](http://localhost:8888/usr/lib64/python3.13/site-packages/sklearn/utils/_set_output.py#line=294), in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    293 @wraps(f)
    294 def wrapped(self, X, *args, **kwargs):
--> 295     data_to_wrap = f(self, X, *args, **kwargs)
    296     if isinstance(data_to_wrap, tuple):
    297         # only wrap the first output for cross decomposition
    298         return_tuple = (
    299             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    300             *data_to_wrap[1:],
    301         )

File [/usr/lib64/python3.13/site-packages/sklearn/base.py:1098](http://localhost:8888/usr/lib64/python3.13/site-packages/sklearn/base.py#line=1097), in TransformerMixin.fit_transform(self, X, y, **fit_params)
   1083         warnings.warn(
   1084             (
   1085                 f"This object ({self.__class__.__name__}) has a `transform`"
   (...)
   1093             UserWarning,
   1094         )
   1096 if y is None:
   1097     # fit method of arity 1 (unsupervised transformation)
-> 1098     return self.fit(X, **fit_params).transform(X)
   1099 else:
   1100     # fit method of arity 2 (supervised transformation)
   1101     return self.fit(X, y, **fit_params).transform(X)

File [/usr/lib64/python3.13/site-packages/sklearn/utils/_set_output.py:295](http://localhost:8888/usr/lib64/python3.13/site-packages/sklearn/utils/_set_output.py#line=294), in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    293 @wraps(f)
    294 def wrapped(self, X, *args, **kwargs):
--> 295     data_to_wrap = f(self, X, *args, **kwargs)
    296     if isinstance(data_to_wrap, tuple):
    297         # only wrap the first output for cross decomposition
    298         return_tuple = (
    299             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    300             *data_to_wrap[1:],
    301         )

File [/usr/lib/python3.13/site-packages/imodels/discretization/discretizer.py:391](http://localhost:8888/usr/lib/python3.13/site-packages/imodels/discretization/discretizer.py#line=390), in ExtraBasicDiscretizer.transform(self, X)
    389 # One-hot encode the ordinal DF
    390 disc_onehot_np = self.encoder_.transform(disc_ordinal_df_str)
--> 391 disc_onehot = pd.DataFrame(
    392     disc_onehot_np, columns=self.encoder_.get_feature_names_out())
    394 # Name columns after the interval they represent (e.g. 0.1_to_0.5)
    395 for col, bin_edges in zip(self.dcols, self.discretizer_.bin_edges_):

File [/usr/lib64/python3.13/site-packages/pandas/core/frame.py:856](http://localhost:8888/usr/lib64/python3.13/site-packages/pandas/core/frame.py#line=855), in DataFrame.__init__(self, data, index, columns, dtype, copy)
    848         mgr = arrays_to_mgr(
    849             arrays,
    850             columns,
   (...)
    853             typ=manager,
    854         )
    855     else:
--> 856         mgr = ndarray_to_mgr(
    857             data,
    858             index,
    859             columns,
    860             dtype=dtype,
    861             copy=copy,
    862             typ=manager,
    863         )
    864 else:
    865     mgr = dict_to_mgr(
    866         {},
    867         index,
   (...)
    870         typ=manager,
    871     )

File [/usr/lib64/python3.13/site-packages/pandas/core/internals/construction.py:336](http://localhost:8888/usr/lib64/python3.13/site-packages/pandas/core/internals/construction.py#line=335), in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    331 # _prep_ndarraylike ensures that values.ndim == 2 at this point
    332 index, columns = _get_axes(
    333     values.shape[0], values.shape[1], index=index, columns=columns
    334 )
--> 336 _check_values_indices_shape_match(values, index, columns)
    338 if typ == "array":
    339     if issubclass(values.dtype.type, str):

File [/usr/lib64/python3.13/site-packages/pandas/core/internals/construction.py:420](http://localhost:8888/usr/lib64/python3.13/site-packages/pandas/core/internals/construction.py#line=419), in _check_values_indices_shape_match(values, index, columns)
    418 passed = values.shape
    419 implied = (len(index), len(columns))
--> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (192, 1), indices imply (192, 9)

If I run the third line before the second, X_test_brl_df = disc.transform(pd.DataFrame(X_test[:, :3], columns=feat_names[:3])), I get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[16], line 2
      1 disc = ExtraBasicDiscretizer(feat_names[:3], n_bins=3, strategy='uniform')
----> 2 X_test_brl_df = disc.transform(pd.DataFrame(X_test[:, :3], columns=feat_names[:3]))

File [/usr/lib64/python3.13/site-packages/sklearn/utils/_set_output.py:295](http://localhost:8888/usr/lib64/python3.13/site-packages/sklearn/utils/_set_output.py#line=294), in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    293 @wraps(f)
    294 def wrapped(self, X, *args, **kwargs):
--> 295     data_to_wrap = f(self, X, *args, **kwargs)
    296     if isinstance(data_to_wrap, tuple):
    297         # only wrap the first output for cross decomposition
    298         return_tuple = (
    299             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    300             *data_to_wrap[1:],
    301         )

File [/usr/lib/python3.13/site-packages/imodels/discretization/discretizer.py:385](http://localhost:8888/usr/lib/python3.13/site-packages/imodels/discretization/discretizer.py#line=384), in ExtraBasicDiscretizer.transform(self, X)
    369 """
    370 Discretize the data.
    371 
   (...)
    381     binned space. All other features remain unchanged.
    382 """
    384 # Apply discretizer transform to get ordinally coded DF
--> 385 disc_ordinal_np = self.discretizer_.transform(X[self.dcols])
    386 disc_ordinal_df = pd.DataFrame(disc_ordinal_np, columns=self.dcols)
    387 disc_ordinal_df_str = disc_ordinal_df.astype(int).astype(str)

AttributeError: 'ExtraBasicDiscretizer' object has no attribute 'discretizer_'

OK, on hindsight I understand why this fails, because we have not trained (no fit before). Running after the second line the error is similar to the one that we get in the second line.