david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
186 stars 38 forks source link

Problem with saving trained isolation forest when `categ_cols` is not None. #32

Closed vsahil closed 3 years ago

vsahil commented 3 years ago

Hello David,

I am facing another issue when I am trying to save a trained iso forest with categ_cols not None. When I do not provide any categorical column numbers (when its None), the model is saved, but when this is not the case, I get this error:


  File "/scratch/vsahil/data-drift-explanation/GOAD/goad-pyenv/lib/python3.6/site-packages/isotree/__init__.py", line 2132, in export_model
    json.dump(metadata, of, indent=4)
  File "/usr/lib64/python3.6/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/usr/lib64/python3.6/json/encoder.py", line 430, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib64/python3.6/json/encoder.py", line 404, in _iterencode_dict
    yield from chunks
  File "/usr/lib64/python3.6/json/encoder.py", line 404, in _iterencode_dict
    yield from chunks
  File "/usr/lib64/python3.6/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/usr/lib64/python3.6/json/encoder.py", line 437, in _iterencode
    o = _default(o)
  File "/usr/lib64/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'int32' is not JSON serializable```

Do you have any clue why this is happening and how can I circumvent this problem? I verified that all the values in the columns marked as categorical are integers with values starting at 0 (I am passing a numpy array in the `fit` function). 
david-cortes commented 3 years ago

This is solved in the latest version (0.2.7-2). pip install -U --no-cache isotree.

david-cortes commented 3 years ago

Actually this is a different issue, will fix, but in the meantime, if you pass it as a list of non-numpy integers it should work.

david-cortes commented 3 years ago

It actually doesn't, it seems to be an issue with NumPy>=1.18: https://github.com/numpy/numpy/issues/19069

Works in older NumPy versions though.

david-cortes commented 3 years ago

Fixed version now uploaded to PyPI.

vsahil commented 3 years ago

Thank you David. I had a quick question, what is the difference in treatment of categorical vs numerical features in the isolation forests?

david-cortes commented 3 years ago

Same as in other decision tree software. You can check the docs and the reference papers for more details.