EvgeniDubov / hellinger-distance-criterion

Random Forest model using Hellinger Distance as split criterion
BSD 3-Clause "New" or "Revised" License
31 stars 12 forks source link

Error installing hellinger-distance criterion #2

Closed wptmdoorn closed 6 years ago

wptmdoorn commented 6 years ago

Hi Evgeni,

Thank you for making this publicly available first and foremost! Also, good luck on your last efforts to implement this into the imblearn package - that is a great effort!

I have been trying to install your package but so far I did not succeed yet. Could you please look into my issue? The steps I undertook:

  1. Download _criterion.pxd file from SKLEARN github and put it into my /tree/ folder of sklearn.
  2. Clone your repository and try to install via "python setup.py build_ext --inplace"

I am receiving the following error (just the first part; but hereafter mainly declaration errors which are logical due to these errors);

hellinger_distance_criterion.pyx: cannot find cimported module 'sklearn.tree._criterion'
Compiling hellinger_distance_criterion.pyx because it changed.
[1/1] Cythonizing hellinger_distance_criterion.pyx
/home/wptmdoorn/anaconda3/envs/2018_SEH/lib/python3.6/site-packages/Cython/Compiler/Main.py:367: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /home/wptmdoorn/hellinger-distance-criterion/hellinger_distance_criterion.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)

Error compiling Cython file:
------------------------------------------------------------
...

# Author: Evgeni Dubov <evgeni.dubov@gmail.com>
#
# License: MIT

from sklearn.tree._criterion cimport ClassificationCriterion
^
------------------------------------------------------------

hellinger_distance_criterion.pyx:6:0: 'sklearn/tree/_criterion.pxd' not found

Error compiling Cython file:
------------------------------------------------------------

It seems it cannot find the files supplied. Thus I re-checked, and also added the tree.pxd and tree.pyd files from the original SKLEARN repository but this did not fix any of the problems. An overview of my sklearn/tree/ directory:

_criterion.cpython-37m-x86_64-linux-gnu.so  __pycache__                                _tree.pxd
_criterion.pxd                              setup.py                                   tree.py
_criterion.pyx                              _splitter.cpython-37m-x86_64-linux-gnu.so  _tree.pyx
export.py                                   tests                                      _utils.cpython-37m-x86_64-linux-gnu.so
__init__.py                                 _tree.cpython-37m-x86_64-linux-gnu.so

Would you have any idea what is going on here?

Thanks a lot in advance!

EvgeniDubov commented 6 years ago

Thanks for the feedback :)

Looking at the info you provided, my assumption is that you copied the _criterion.pyx to the sklearn folder of your main python environment while you are working in a virtual environment.

Please let me know if it works for you

wptmdoorn commented 6 years ago

It's those stupid things you always overlook when you basically looked into everything. It is working now, so I will close this issue - thank you!

Please let me know if you ever need any help with any (additional) testing, I would be more then willing to help you out!

EvgeniDubov commented 6 years ago

Sure, it happens to me all the time :)

Thanks a lot for the offer!

Actually it would be very interesting to know if this algorithm brings better results in your modeling use case. But in case you're working on something that can't be publicly shared I totally understand.

wptmdoorn commented 6 years ago

For sure I would be willing to share some results! I am not sure I can share everything, but for sure enough for it to make any sense ;)

I am dealing with a dataset with about ~70k rows containing ~220 variables (depending on how much we select). The most interesting part is that about 80% of the variables is missing for each row (very heterogenous). The output is binary and the class inbalance is only about 1:16 (not that extremely bad). I strongly believe that the Hellinger distance would bring better results. I will try to deliver you some data soon :)

EvgeniDubov commented 6 years ago

Interesting use case, hope Hellinger will bring you some added value. Thanks for sharing!

BaharZoghi commented 5 years ago

Hello,

I have Anaconda3 on my windows system. Sklearn is installed both in Lib in Anaconda's location and 'envs'->'[My envs name]'->'Lib'->'site-package'. I followed up the instruction on https://github.com/EvgeniDubov/hellinger-distance-criterion. So, first I cloned 'hellinger-distance-criterion' somewhere in my system. Second, I got _criterion.pxd and copied and replaced it with my original _criterion.pxd in both main and virtual environments. Then I opened the 'hellinger-distance-criterion' folder in anaconda's console and installed the module using python setup.py build_ext --inplace. Everything looked normal and no error was shown. But when I open the spyder or Pycharm and run from hellinger_distance_criterion import HellingerDistanceCriterion it shows 'ModuleNotFoundError: No module named 'hellinger_distance_criterion'. Would you please let me know what might be the problem exactly and how to solve it? Thank you