david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
186 stars 38 forks source link

Models without `categ_cols` fail to export #30

Closed zds0 closed 3 years ago

zds0 commented 3 years ago

Models without categ_cols (i.e. models fit on dataframes with numeric columns only) fail to export.

To reproduce, run the python example in the README

import numpy as np
from isotree import IsolationForest

### Random data from a standard normal distribution
np.random.seed(1)
n = 100
m = 2
X = np.random.normal(size = (n, m))

### Will now add obvious outlier point (3, 3) to the data
X = np.r_[X, np.array([3, 3]).reshape((1, m))]

### Fit a small isolation forest model
iso = IsolationForest(ntrees = 10, ndim = 2, nthreads = 1)
iso.fit(X)

### Check which row has the highest outlier score
pred = iso.predict(X)
print("Point with highest outlier score: ",
      X[np.argsort(-pred)[0], ])

and then try to call export_model

iso.export_model('test_isolationforest', use_cpp=True)

the following error is thrown

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-8ede0bd56e06> in <module>
----> 1 iso.export_model('test_isolationforest', use_cpp=True)

isotree/__init__.py in export_model(self, file, use_cpp)
   2128         """
   2129         assert self.is_fitted_
-> 2130         metadata = self._export_metadata()
   2131         with open(file + ".metadata", "w") as of:
   2132             json.dump(metadata, of, indent=4)

isotree/__init__.py in _export_metadata(self)
   2346             "cols_categ" : list(self.cols_categ_),
   2347             "cat_levels" : [list(m) for m in self._cat_mapping],
-> 2348             "categ_cols" : list(self.categ_cols),
   2349             "categ_max" : list(self._cat_max_lev)
   2350         }

TypeError: 'NoneType' object is not iterable

I was able to resolve this by replacing line 2348 (in the data_info dict within export_metadata()) with the following:

            "categ_cols": [] if self.categ_cols is None else list(self.categ_cols),

However, I'm unsure if this if the complete/proper solution. Thanks.

david-cortes commented 3 years ago

Thanks for the bug report. Yes, that small change would solve the issue.

zds0 commented 3 years ago

This wasn't quite the complete solution. There's a related (and perhaps uncommon) scenario that encounters a similar bug.

When you import a saved model (extending on the example above)

iso_imp = IsolationForest.import_model('test_isolationforest', use_cpp=True)

and then re-export that same model object

iso_imp.export_model('test_isolationforest2', use_cpp=True)

the following error is thrown

TypeError                                 Traceback (most recent call last)
<ipython-input-14-a78373200393> in <module>
----> 1 iso_imp.export_model('test_isolationforest2', use_cpp=True)

isotree/__init__.py in export_model(self, file, use_cpp)
   2128         """
   2129         assert self.is_fitted_
-> 2130         metadata = self._export_metadata()
   2131         with open(file + ".metadata", "w") as of:
   2132             json.dump(metadata, of, indent=4)

isotree/__init__.py in _export_metadata(self)
   2363             "cat_levels" : [list(m) for m in self._cat_mapping],
   2364             "categ_cols" : [] if self.categ_cols is None else list(self.categ_cols),
-> 2365             "categ_max" : list(self._cat_max_lev)
   2366         }
   2367 

TypeError: 'NoneType' object is not iterable

I was able to resolve by replacing like 2365 with

"categ_max" : [] if self._cat_max_lev is None else list(self._cat_max_lev)

But again I'll leave it to you to decide if this is functionality you wish to support and to determine if this is the correct solution (I think there are a few different ways of handling).

Thanks for the great work on this research and this library!

david-cortes commented 3 years ago

Thanks again. That should solve it, but I've added a couple extra checks just in case.