AlexRodis / bayesian-models

A small library build on top of `pymc` that implements many common models
Apache License 2.0
0 stars 0 forks source link

[BUG]: `BEST` raises `KeyError` #60

Open AlexRodis opened 1 year ago

AlexRodis commented 1 year ago

The following error is raise in a jupyter-lab environment. data_df is unremarkable pandas.DataFrame whose first column is the grouping variable (as a string). Strangely this bug only raises some times. Providing other arrays does (with the same structure) does not

data_df = tidy_multiindex(data_df)
obj = BEST()(data_df, 'origin.cultivation.altitude')
obj.fit()
obj.predict(ropes=[(-.5,.5)], hdis=[.95])

/media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/bayesian_models.py:425: UserWarning: Warning! The input DataFrame contains missing or invalid values. Set the value of `nan_handling` to control how these values are handled. Current flag: "exclude"
  warn(('Warning! The input DataFrame contains missing or '

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/olives_env/lib/python3.10/site-packages/pandas/core/indexes/base.py:3803, in Index.get_loc(self, key, method, tolerance)
   3802 try:
-> 3803     return self._engine.get_loc(casted_key)
   3804 except KeyError as err:

File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/olives_env/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/olives_env/lib/python3.10/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'origin.cultivation.altitude'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[163], line 2
      1 data_df = tidy_multiindex(data_df)
----> 2 obj = BEST()(data_df, 'origin.cultivation.altitude')
      3 obj.fit()
      4 obj.predict(ropes=[(-.5,.5)], hdis=[.95])

File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/bayesian_models.py:952, in BEST.__call__(self, data, group_var)
    929 '''
    930     Initialized the full probability model
    931 
   (...)
    948         - obj:BEST := The object
    949 '''
    950 data =self.tidify_data(data) if self.tidify_data is not None \
    951     else data
--> 952 self._preprocessing_(data, group_var)
    953 if self.scaler is not None:
    954     data = self.scaler(data.loc[:,self.features])

File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/bayesian_models.py:784, in BEST._preprocessing_(self, data, group_var)
    763 '''
    764     Handled data preprocessing steps by 1. checking and 
    765     handling missing values, 2. collapsing multiindices
   (...)
    780         - None
    781 '''
    782 self.nan_present_flag = BEST.check_missing_nan(
    783     data, self.nan_handling)
--> 784 self.levels = data.loc[:,group_var].dropna().unique()
    785 self.num_levels=len(self.levels)
    786 self.features = data.columns.difference([group_var])

File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/olives_env/lib/python3.10/site-packages/pandas/core/indexing.py:1067, in _LocationIndexer.__getitem__(self, key)
   1065     if self._is_scalar_access(key):
   1066         return self.obj._get_value(*key, takeable=self._takeable)
-> 1067     return self._getitem_tuple(key)
   1068 else:
   1069     # we by definition only have the 0th axis
   1070     axis = self.axis or 0

File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/olives_env/lib/python3.10/site-packages/pandas/core/indexing.py:1247, in _LocIndexer._getitem_tuple(self, tup)
   1245 with suppress(IndexingError):
   1246     tup = self._expand_ellipsis(tup)
-> 1247     return self._getitem_lowerdim(tup)
   1249 # no multi-index, so validate all of the indexers
   1250 tup = self._validate_tuple_indexer(tup)

File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/olives_env/lib/python3.10/site-packages/pandas/core/indexing.py:967, in _LocationIndexer._getitem_lowerdim(self, tup)
    963 for i, key in enumerate(tup):
    964     if is_label_like(key):
    965         # We don't need to check for tuples here because those are
    966         #  caught by the _is_nested_tuple_indexer check above.
--> 967         section = self._getitem_axis(key, axis=i)
    969         # We should never have a scalar section here, because
    970         #  _getitem_lowerdim is only called after a check for
    971         #  is_scalar_access, which that would be.
    972         if section.ndim == self.ndim:
    973             # we're in the middle of slicing through a MultiIndex
    974             # revise the key wrt to `section` by inserting an _NS

File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/olives_env/lib/python3.10/site-packages/pandas/core/indexing.py:1312, in _LocIndexer._getitem_axis(self, key, axis)
   1310 # fall thru to straight lookup
   1311 self._validate_key(key, axis)
-> 1312 return self._get_label(key, axis=axis)

File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/olives_env/lib/python3.10/site-packages/pandas/core/indexing.py:1260, in _LocIndexer._get_label(self, label, axis)
   1258 def _get_label(self, label, axis: int):
   1259     # GH#5567 this will fail if the label is not present in the axis.
-> 1260     return self.obj.xs(label, axis=axis)

File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/olives_env/lib/python3.10/site-packages/pandas/core/generic.py:4041, in NDFrame.xs(self, key, axis, level, drop_level)
   4039 if axis == 1:
   4040     if drop_level:
-> 4041         return self[key]
   4042     index = self.columns
   4043 else:
'''
File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/olives_env/lib/python3.10/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
   3803 if self.columns.nlevels > 1:
   3804     return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
   3806 if is_integer(indexer):
   3807     indexer = [indexer]

File /media/alexander-fyrogenis/Elements/Διδακτορικό/Olive Oil/notebooks/olives_env/lib/python3.10/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key, method, tolerance)
   3803     return self._engine.get_loc(casted_key)
   3804 except KeyError as err:
-> 3805     raise KeyError(key) from err
   3806 except TypeError:
   3807     # If we have a listlike key, _check_indexing_error will raise
   3808     #  InvalidIndexError. Otherwise we fall through and re-raise
   3809     #  the TypeError.
   3810     self._check_indexing_error(key)

KeyError: 'origin.cultivation.altitude'
'''
AlexRodis commented 1 year ago

Unable to reproduce the issue with pickled structure. May be related to #57

AlexRodis commented 1 year ago

Partially fixed. Due to pymc limitations the categorical has to be encoded as an int, and the resulting array cast to floats

AlexRodis commented 1 year ago

Issue reappeared