ClimbsRocks / auto_ml

[UNMAINTAINED] Automated machine learning for analytics & production
http://auto-ml.readthedocs.io
MIT License
1.64k stars 310 forks source link

error during LGBM predict_proba #336

Closed vkocaman closed 7 years ago

vkocaman commented 7 years ago

Hi all..

After long hours of training my model with lightgbm, I just run predict_proba and at first I ran into data_rate_limit in Jupyiter.. then I changed that limit and had to train the model again.. but this time I ran into another error:

Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

can someone help me please? thanks

ClimbsRocks commented 7 years ago

Hi @vkocaman

It's nearly impossible to debug without a stack trace. Can you please copy/paste the error message, along with all the other output that could help us debug?

As a general best practice, I like training on a small sample of the dataset (say, 1%) to make sure that things work before training on the entire dataset.

You might also fix this issue by upgrading all of your libraries with pip install --upgrade auto_ml and pip install --upgrade lightgbm.

If that doesn't fix it, could you please also include the output of pip freeze?

vkocaman commented 7 years ago

here is the complete error message


TypeError Traceback (most recent call last) /Users/vkocaman/anaconda/envs/py36/lib/python3.6/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, *kwds) 56 try: ---> 57 return getattr(obj, method)(args, **kwds) 58

TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)

in () ----> 1 probs=ml_predictor.predict_proba(df_test) 2 print ("probabilities:",probs) /Users/vkocaman/anaconda/envs/py36/lib/python3.6/site-packages/auto_ml/predictor.py in predict_proba(self, prediction_data) 1638 prediction_data = prediction_data.copy() 1639 -> 1640 return self.trained_pipeline.predict_proba(prediction_data) 1641 1642 /Users/vkocaman/anaconda/envs/py36/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in (*args, **kwargs) 113 114 # lambda, but not partial, allows help() to work with update_wrapper --> 115 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs) 116 # update the docstring of the returned function 117 update_wrapper(out, self.fn) /Users/vkocaman/anaconda/envs/py36/lib/python3.6/site-packages/sklearn/pipeline.py in predict_proba(self, X) 363 for name, transform in self.steps[:-1]: 364 if transform is not None: --> 365 Xt = transform.transform(Xt) 366 return self.steps[-1][-1].predict_proba(Xt) 367 /Users/vkocaman/anaconda/envs/py36/lib/python3.6/site-packages/auto_ml/DataFrameVectorizer.py in transform(self, X, y) 175 176 def transform(self, X, y=None): --> 177 return self._transform(X) 178 179 def get_feature_names(self): /Users/vkocaman/anaconda/envs/py36/lib/python3.6/site-packages/auto_ml/DataFrameVectorizer.py in _transform(self, X) 145 val = '_None' 146 --> 147 val = self.get('label_encoders')[f].transform([val]) 148 149 # Only include this in our output if it was part of our training data. Silently ignore it otherwise. /Users/vkocaman/anaconda/envs/py36/lib/python3.6/site-packages/auto_ml/utils.py in transform(self, y) 182 diff = np.setdiff1d(classes, self.classes_) 183 self.classes_ = np.hstack((self.classes_, diff)) --> 184 return np.searchsorted(self.classes_, y) 185 186 class ExtendedPipeline(Pipeline): /Users/vkocaman/anaconda/envs/py36/lib/python3.6/site-packages/numpy/core/fromnumeric.py in searchsorted(a, v, side, sorter) 1073 1074 """ -> 1075 return _wrapfunc(a, 'searchsorted', v, side=side, sorter=sorter) 1076 1077 /Users/vkocaman/anaconda/envs/py36/lib/python3.6/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds) 65 # a downstream library like 'pandas'. 66 except (AttributeError, TypeError): ---> 67 return _wrapit(obj, method, *args, **kwds) 68 69 /Users/vkocaman/anaconda/envs/py36/lib/python3.6/site-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds) 45 except AttributeError: 46 wrap = None ---> 47 result = getattr(asarray(obj), method)(*args, **kwds) 48 if wrap: 49 if not isinstance(result, mu.ndarray): TypeError: Cannot cast array data from dtype('float64') to dtype('
vkocaman commented 7 years ago

I already made sure that all packages are up to date as for auto_ml.. and here is the pip freeze

alabaster==0.7.10 anaconda-client==1.6.3 anaconda-navigator==1.6.2 anaconda-project==0.6.0 appnope==0.1.0 appscript==1.0.1 asn1crypto==0.22.0 astroid==1.4.9 astropy==1.3.2 auto-ml==2.7.6 auto-sklearn==0.2.1 Babel==2.4.0 backports.shutil-get-terminal-size==1.0.0 beautifulsoup4==4.6.0 bitarray==0.8.1 blaze==0.10.1 bleach==1.5.0 bokeh==0.12.5 boto==2.46.1 Bottleneck==1.2.1 cffi==1.10.0 chardet==3.0.3 click==6.7 cloudpickle==0.2.2 clyent==1.2.2 colorama==0.3.9 ConfigSpace==0.3.10 contextlib2==0.5.5 cryptography==1.8.1 cycler==0.10.0 Cython==0.27.1 cytoolz==0.8.2 dask==0.14.3 datashape==0.5.4 deap==1.0.2 decorator==4.0.11 dill==0.2.7.1 distributed==1.16.3 docutils==0.13.1 entrypoints==0.2.2 et-xmlfile==1.0.1 fastcache==1.0.2 Flask==0.12.2 Flask-Cors==3.0.2 gevent==1.2.1 greenlet==0.4.12 h5py==2.7.1 HeapDict==1.0.0 html5lib==0.9999999 idna==2.5 imagesize==0.7.1 ipykernel==4.6.1 ipython==5.3.0 ipython-genutils==0.2.0 ipywidgets==6.0.0 isort==4.2.5 itsdangerous==0.24 jdcal==1.3 jedi==0.10.2 Jinja2==2.9.6 joblib==0.11 jsonschema==2.6.0 jupyter==1.0.0 jupyter-client==5.0.1 jupyter-console==5.1.0 jupyter-core==4.3.0 Keras==2.0.8 lazy-object-proxy==1.2.2 liac-arff==2.1.1 lightgbm==2.0.7 llvmlite==0.18.0 locket==0.2.0 lockfile==0.12.2 lxml==3.7.3 Markdown==2.6.9 MarkupSafe==0.23 matplotlib==2.0.2 mistune==0.7.4 mpmath==0.19 msgpack-python==0.4.8 multipledispatch==0.4.9 multiprocess==0.70.5 navigator-updater==0.1.0 nbconvert==5.1.1 nbformat==4.3.0 networkx==1.11 nltk==3.2.3 nose==1.3.7 notebook==5.0.0 numba==0.33.0 numexpr==2.6.2 numpy==1.13.3 numpydoc==0.6.0 odo==0.5.0 olefile==0.44 openpyxl==2.4.7 packaging==16.8 pandas==0.20.3 pandocfilters==1.4.1 partd==0.3.8 pathlib2==2.2.1 pathos==0.2.1 patsy==0.4.1 pep8==1.7.0 pexpect==4.2.1 pickleshare==0.7.4 Pillow==4.1.1 ply==3.10 pox==0.2.3 ppft==1.6.4.7.1 prompt-toolkit==1.0.14 protobuf==3.4.0 psutil==5.3.1 ptyprocess==0.5.1 py==1.4.33 pycosat==0.6.2 pycparser==2.17 pycrypto==2.6.1 pycurl==7.43.0 pyflakes==1.5.0 Pygments==2.2.0 pylint==1.6.4 pynisher==0.4.2 pyodbc==4.0.16 pyOpenSSL==17.0.0 pyparsing==2.1.4 pytest==3.0.7 python-dateutil==2.6.1 pytz==2017.2 PyWavelets==0.5.2 PyYAML==3.12 pyzmq==16.0.2 QtAwesome==0.4.4 qtconsole==4.3.0 QtPy==1.2.1 requests==2.14.2 rope-py3k==0.9.4.post1 scikit-image==0.13.0 scikit-learn==0.19.0 scikit-MDR==0.4.4 scipy==0.19.1 seaborn==0.7.1 simplegeneric==0.8.1 singledispatch==3.4.0.3 six==1.11.0 sklearn==0.0 sklearn-deap2==0.2.1 skrebate==0.3.4 smac==0.6.0 snowballstemmer==1.2.1 sortedcollections==0.5.3 sortedcontainers==1.5.7 Sphinx==1.5.6 sphinx-rtd-theme==0.2.4 spyder==3.1.4 SQLAlchemy==1.1.9 statsmodels==0.8.0 stopit==1.1.1 sympy==1.0 tables==3.3.0 tabulate==0.8.1 tblib==1.3.2 tensorflow==1.3.0 tensorflow-tensorboard==0.1.8 terminado==0.6 testpath==0.3 toolz==0.8.2 tornado==4.5.1 TPOT==0.9.0 tqdm==4.19.1.post1 traitlets==4.3.2 typing==3.6.2 unicodecsv==0.14.1 update-checker==0.16 wcwidth==0.1.7 Werkzeug==0.12.2 widgetsnbextension==2.0.0 wrapt==1.10.10 xlrd==1.0.0 XlsxWriter==0.9.6 xlwings==0.10.4 xlwt==1.2.0 zict==0.1.2

vkocaman commented 7 years ago

Dear Preston,

Not to train all the model again, is there any other way to pass the best lgbm parameters into model so that I don't need to optimize final model again?

vkocaman commented 7 years ago

btw, after making sure that all packages are up to date, I just trained 1% of training set.. but no change..

ClimbsRocks commented 7 years ago

the stack trace helps a lot, thanks!

it looks like you're probably trying to feed in a column of dtype float64 as a categorical column. when we try to convert that to a 32 bit string, it starts throwing an error because of the loss in precision.

my guess is you're probably feeding in some column like user_id or order_id as a categorical column. that would also explain why it takes a while to train. these columns should almost always be ignored, not used as categorical values.

i'll release a patch to handle this later tonight probably, but in the meantime, you can probably handle this yourself by just ignoring any categorical columns that are of dtype float64. or, convert those to a string yourself beforehand, and see if that handles it.

vkocaman commented 7 years ago

Thanks.. Even though I specified the categorical columns in column descriptions, there was error.. but I just changed the types of float64 columns to int and it produced predict_probas now.. but there were just ones and zeros.. and it's not what I need.. anyway, thank you again..

ClimbsRocks commented 7 years ago

yeah, that's because lightgbm released a breaking update, without any deprecation warnings. you can use the previous version of lightgbm (v2.0.6), or i'll have a new release ready later tonight that fixes it too.

thanks for filing all the issues!

ClimbsRocks commented 7 years ago

alright, should be all handled in the latest release (v2.7.7).

if you run into lightgbm issues, and you're on the latest version, let me know. it's running fine for a couple projects i'm working on, and in the test suite, but i'm always open to learning how other people use things.

sharpe5 commented 7 years ago

Thanks, nice work!