AxeldeRomblay / MLBox

MLBox is a powerful Automated Machine Learning python library.
https://mlbox.readthedocs.io/en/latest/
Other
1.49k stars 274 forks source link

should python 3.7 be compatible? #72

Closed jlarrieux closed 4 years ago

jlarrieux commented 4 years ago

I just installed the latest version of python 3.7.4. Is there any particular reason why MLBox is not compatible with 3.7?

AxeldeRomblay commented 4 years ago

Hello @jlarrieux ! It should work but we didn't make sure the setup and the tests pass on py37. Anyway, it will officially be supported very soon ;)
Also, if something fails on py37 please let us know ! Thanks

Revo2407 commented 4 years ago

Hey @AxeldeRomblay ! Thanks for the great package. But yeah the same problem as mentioned above.

jimthompson5802 commented 4 years ago

Just to add specificity to this thread.

Building mlbox Docker image using continuumio/miniconda3:latest as the base, which contains Python 3.7. (NOTE: No problems are encountered when using an older version continuumio/miniconda3:4.3.27, which contains Py3.6.)

When I attempt to install mlbox with Py3.7, I encounter this error:

#5 86.39     building 'sklearn.cluster._dbscan_inner' extension
#5 86.39     compiling C++ sources
#5 86.39     C compiler: g++ -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC
#5 86.39     
#5 86.39     creating build/temp.linux-x86_64-3.7/sklearn/cluster
#5 86.39     compile options: '-I/opt/conda/lib/python3.7/site-packages/numpy/core/include -I/opt/conda/lib/python3.7/site-packages/numpy/core/include -I/opt/conda/include/python3.7m -c'
#5 86.39     g++: sklearn/cluster/_dbscan_inner.cpp
#5 86.39     In file included from /opt/conda/lib/python3.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1830:0,
#5 86.39                      from /opt/conda/lib/python3.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
#5 86.39                      from /opt/conda/lib/python3.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
#5 86.39                      from sklearn/cluster/_dbscan_inner.cpp:470:
#5 86.39     /opt/conda/lib/python3.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#5 86.39      #warning "Using deprecated NumPy API, disable it with " \
#5 86.39       ^~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp: In function ‘void __Pyx__ExceptionSave(PyThreadState*, PyObject**, PyObject**, PyObject**)’:
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5899:21: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_type’; did you mean ‘curexc_type’?
#5 86.39          *type = tstate->exc_type;
#5 86.39                          ^~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5900:22: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_value’; did you mean ‘curexc_value’?
#5 86.39          *value = tstate->exc_value;
#5 86.39                           ^~~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5901:19: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_traceback’; did you mean ‘curexc_traceback’?
#5 86.39          *tb = tstate->exc_traceback;
#5 86.39                        ^~~~~~~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp: In function ‘void __Pyx__ExceptionReset(PyThreadState*, PyObject*, PyObject*, PyObject*)’:
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5908:24: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_type’; did you mean ‘curexc_type’?
#5 86.39          tmp_type = tstate->exc_type;
#5 86.39                             ^~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5909:25: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_value’; did you mean ‘curexc_value’?
#5 86.39          tmp_value = tstate->exc_value;
#5 86.39                              ^~~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5910:22: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_traceback’; did you mean ‘curexc_traceback’?
#5 86.39          tmp_tb = tstate->exc_traceback;
#5 86.39                           ^~~~~~~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5911:13: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_type’; did you mean ‘curexc_type’?
#5 86.39          tstate->exc_type = type;
#5 86.39                  ^~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5912:13: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_value’; did you mean ‘curexc_value’?
#5 86.39          tstate->exc_value = value;
#5 86.39                  ^~~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5913:13: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_traceback’; did you mean ‘curexc_traceback’?
#5 86.39          tstate->exc_traceback = tb;
#5 86.39                  ^~~~~~~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp: In function ‘int __Pyx__GetException(PyThreadState*, PyObject**, PyObject**, PyObject**)’:
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5968:24: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_type’; did you mean ‘curexc_type’?
#5 86.39          tmp_type = tstate->exc_type;
#5 86.39                             ^~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5969:25: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_value’; did you mean ‘curexc_value’?
#5 86.39          tmp_value = tstate->exc_value;
#5 86.39                              ^~~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5970:22: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_traceback’; did you mean ‘curexc_traceback’?
#5 86.39          tmp_tb = tstate->exc_traceback;
#5 86.39                           ^~~~~~~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5971:13: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_type’; did you mean ‘curexc_type’?
#5 86.39          tstate->exc_type = local_type;
#5 86.39                  ^~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5972:13: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_value’; did you mean ‘curexc_value’?
#5 86.39          tstate->exc_value = local_value;
#5 86.39                  ^~~~~~~~~
#5 86.39     sklearn/cluster/_dbscan_inner.cpp:5973:13: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_traceback’; did you mean ‘curexc_traceback’?
#5 86.39          tstate->exc_traceback = local_tb;
#5 86.39                  ^~~~~~~~~~~~~
#5 86.39     error: Command "g++ -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/opt/conda/lib/python3.7/site-packages/numpy/core/include -I/opt/conda/lib/python3.7/site-packages/numpy/core/include -I/opt/conda/include/python3.7m -c sklearn/cluster/_dbscan_inner.cpp -o build/temp.linux-x86_64-3.7/sklearn/cluster/_dbscan_inner.o -MMD -MF build/temp.linux-x86_64-3.7/sklearn/cluster/_dbscan_inner.o.d" failed with exit status 1
#5 86.39     
#5 86.39     ----------------------------------------
#5 87.00 Command "/opt/conda/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-nhyfkhn9/scikit-learn/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-vu615qry/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-nhyfkhn9/scikit-learn/
#5 ERROR: executor failed running [/bin/sh -c apt-get update &&     apt-get install -y build-essential &&     pip install mlflow mlbox]: exit code: 1
------
 > [2/3] RUN apt-get update &&     apt-get install -y build-essential &&     pip install mlflow mlbox:
------
executor failed running [/bin/sh -c apt-get update &&     apt-get install -y build-essential &&     pip install mlflow mlbox]: exit code: 1

If needed this is the full build log. docker_build.txt

And this is my Dockerfile that encounters the error

FROM continuumio/miniconda3:latest

# 
# install additional packages
#
RUN apt-get update && \
    apt-get install -y build-essential && \
    pip install mlflow mlbox

WORKDIR /opt/project
ENV MLFLOW_TRACKING_URI /opt/project/tracking
jimthompson5802 commented 4 years ago

@AxeldeRomblay In researching this, I found this issue in sklearn and this issue. The cited problem symptoms are the same. It appears the issue the issue of sklearn called out by the dependency.

manugarri commented 4 years ago

overwriting the scikitlearn-0.19.0 for 0.20.0 (the first p37 compatible version) fixes that installation error, however, a similar error happens afterwards with pandas.

manugarri commented 4 years ago

one bruteforce method that maked the installation work for me on python3.7 was to overwrite the repo requirements.txt with this

numpy
scipy
matplotlib
hyperopt
Keras
pandas
joblib
scikit-learn
tensorflow
lightgbm
networkx
tables
xlrd

Still need to test the installation to make sure it actually work, since this is a very bad way to setup a package.

jimthompson5802 commented 4 years ago

@manugarri like you I've been fiddling with the requirements.txt file, though not as aggressively as you. I was able to get mlbox installed on a Python 3.7 docker image but the unit tests failed due to one of the packages I left alone was not compatible with Py3.7. Right now I'm looking at ways of enhancing the requirements.txt file to support 3.5, 3.6 and 3.7 environments. Once I work out the specifications, I'll submit a PR.

manugarri commented 4 years ago

great @jimthompson5802 , would you mind sharing what you have so far?

jimthompson5802 commented 4 years ago

@manugarri Some good news and bad news. Good news: This requirements.txt is able to build mlbox on both Py3.6 and Py3.7.

numpy==1.16.3
scipy==1.2.1
matplotlib==2.2.4

hyperopt==0.1; python_version<'3.7'
hyperopt; python_version=='3.7'

Keras==2.1.2

pandas==0.21.0; python_version<'3.7'
pandas; python_version=='3.7'

joblib==0.11

scikit-learn==0.19.0; python_version<'3.7'
scikit-learn; python_version=='3.7'

tensorflow==1.13.1
lightgbm==2.2.3
networkx==1.11
tables==3.5.2
xlrd==1.2.

On Py3.6 all 90 unit test pass.

Now the bad news. Under py3.7, 1 of the 90 unit tests fails. This is the subset of the unit tests that encounters a problem. The optimizer returns -inf when strategy='LightGBM'.

(base) root@c7cb9f51ddba:/opt/project/tests# pytest -v test_optimiser.py
========================================================= test session starts ==========================================================
platform linux -- Python 3.7.3, pytest-5.0.1, py-1.8.0, pluggy-0.12.0 -- /opt/conda/bin/python
cachedir: .pytest_cache
rootdir: /opt/project
collected 6 items

test_optimiser.py::test_init_optimiser PASSED                                                                                    [ 16%]
test_optimiser.py::test_get_params_optimiser PASSED                                                                              [ 33%]
test_optimiser.py::test_set_params_optimiser PASSED                                                                              [ 50%]
test_optimiser.py::test_evaluate_classification_optimiser FAILED                                                                 [ 66%]
test_optimiser.py::test_evaluate_regression_optimiser PASSED                                                                     [ 83%]
test_optimiser.py::test_evaluate_and_optimise_classification PASSED                                                              [100%]

=============================================================== FAILURES ===============================================================
________________________________________________ test_evaluate_classification_optimiser ________________________________________________

    def test_evaluate_classification_optimiser():
        """Test evaluate method of Optimiser class for classication."""
        reader = Reader(sep=",")
        dict = reader.train_test_split(Lpath=["data_for_tests/train.csv",
                                              "data_for_tests/test.csv"],
                                       target_name="Survived")
        drift_thresholder = Drift_thresholder()
        drift_thresholder = drift_thresholder.fit_transform(dict)

        with pytest.warns(UserWarning) as record:
            opt = Optimiser(scoring=None, n_folds=3)
        assert len(record) == 1
        score = opt.evaluate(None, dict)
        assert -np.Inf <= score

        with pytest.warns(UserWarning) as record:
            opt = Optimiser(scoring="roc_auc", n_folds=3)
        assert len(record) == 1
        score = opt.evaluate(None, dict)
>       assert 0. <= score <= 1.
E       assert 0.0 <= -inf

test_optimiser.py:80: AssertionError
--------------------------------------------------------- Captured stdout call ---------------------------------------------------------

reading csv : train.csv ...
cleaning data ...
CPU time: 0.4243955612182617 seconds

reading csv : test.csv ...
cleaning data ...
CPU time: 0.43333911895751953 seconds

> Number of common features : 11

gathering and crunching for train and test datasets ...
reindexing for train and test datasets ...
dropping training duplicates ...
dropping constant variables on training set ...

> Number of categorical features: 5
> Number of numerical features: 6
> Number of training samples : 891
> Number of test samples : 418

> Top sparse features (% missing values on train set):
Cabin       77.1
Age         19.9
Embarked     0.2
dtype: float64

> Task : classification
0.0    549
1.0    342
Name: Survived, dtype: int64

encoding target ...

computing drifts ...
CPU time: 0.34601926803588867 seconds

> Top 10 drifts

('PassengerId', 0.9976076555023923)
('Name', 0.9896048912939972)
('Ticket', 0.6639851080864305)
('Cabin', 0.17448513424346968)
('Embarked', 0.07253166146860801)
('Pclass', 0.07092171378991874)
('Age', 0.04932010984509971)
('Fare', 0.04106444202455006)
('Parch', 0.03875259370548334)
('SibSp', 0.038127922627237076)

> Deleted variables : ['Name', 'PassengerId', 'Ticket']
> Drift coefficients dumped into directory : save
No parameters set. Default configuration is tested

##################################################### testing hyper-parameters... #####################################################

>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}

>>> CA ENCODER :{'strategy': 'label_encoding'}

>>> ESTIMATOR :{'strategy': 'LightGBM', 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 0.9, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'nthread': -1, 'seed': 0}

MEAN SCORE : log_loss = -inf
VARIANCE : nan (fold 1 = -inf, fold 2 = -inf, fold 3 = -inf)
CPU time: 0.0018298625946044922 seconds

No parameters set. Default configuration is tested

##################################################### testing hyper-parameters... #####################################################

>>> NA ENCODER :{'numerical_strategy': 'mean', 'categorical_strategy': '<NULL>'}

>>> CA ENCODER :{'strategy': 'label_encoding'}

>>> ESTIMATOR :{'strategy': 'LightGBM', 'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'split', 'learning_rate': 0.05, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 500, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': True, 'subsample': 0.9, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'nthread': -1, 'seed': 0}

[LightGBM] [Warning] num_threads is set with nthread=-1, will be overridden by n_jobs=-1. Current value: num_threads=-1

MEAN SCORE : roc_auc = -inf
VARIANCE : nan (fold 1 = -inf, fold 2 = -inf, fold 3 = -inf)
CPU time: 0.5437860488891602 seconds

I'm digging around to see if I can find root cause. So far, no luck. Any suggestions will be appreciated.

manugarri commented 4 years ago

Awesome, that is much more robust! I am going to test that locally.

BTW, Ive noticed getting a lot of nan's and infs for evaluation results as well, is that an expected behaviour, or due to the existence of NaNs that are not cleaned up by the train_test_split function?

AxeldeRomblay commented 4 years ago

Hello, sorry for the late answer...

Thanks for opening the issue and @jimthompson5802 for investigating it :) Actually I was already working on the requirements to upgrade all the packages, see: https://github.com/AxeldeRomblay/MLBox/blob/0.8.1/requirements.txt

The test that fails is due to an invalid metric ("roc_auc" here in the binary classification), so I have modified the code (https://github.com/AxeldeRomblay/MLBox/blob/0.8.1/mlbox/optimisation/optimiser.py#L220). It will be pushed soon !

manugarri commented 4 years ago

neat!, thanks @AxeldeRomblay

AxeldeRomblay commented 4 years ago

Hello everyone, Good news : MLBox 0.8.1 has just been released on PyPI and the issues are fixed (it is also compatible with python 3.7). Enjoy !