ddangelov / Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.
BSD 3-Clause "New" or "Revised" License
2.93k stars 374 forks source link

Task failing to unserialize #229

Closed jonlee112 closed 2 years ago

jonlee112 commented 2 years ago

Wondering if anyone can help with this issue:

When I run code using top2vec (deep-learn) on a dataset containing ~40,000 tweets (all duplicate tweets removed), it works fine.

However, when I use that same code on a dataset containing ~140,000 tweets (duplicates not removed), I keep getting this errror:

BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

The output is as follows:

2021-12-13 06:53:50,024 - top2vec - INFO - Pre-processing documents for training 2021-12-13 06:54:06,047 - top2vec - INFO - Creating joint document/word embedding 2021-12-13 08:54:26,751 - top2vec - INFO - Creating lower dimension embedding of documents /usr/local/lib/python3.7/dist-packages/numba/np/ufunc/parallel.py:363: NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 9107. The TBB threading layer is disabled. warnings.warn(problem) 2021-12-13 08:58:33,953 - top2vec - INFO - Finding dense areas of documents

_RemoteTraceback Traceback (most recent call last) _RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker call_item = call_queue.get(block=True, timeout=timeout) File "/usr/lib/python3.7/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.setstate File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper File "stringsource", line 349, in View.MemoryView.memoryview.cinit ValueError: buffer source array is read-only """

The above exception was the direct cause of the following exception:

BrokenProcessPool Traceback (most recent call last)

in () ----> 1 model = Top2Vec(documents=list_of_tweets, speed="deep-learn") 2 model.save("model_donttrust_cdc_deep_woduplicates") 9 frames hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__() hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds() /usr/lib/python3.7/concurrent/futures/_base.py in __get_result(self) 382 def __get_result(self): 383 if self._exception: --> 384 raise self._exception 385 else: 386 return self._result BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable. "
ddangelov commented 2 years ago

What environment is this in? This looks like hdbscan failing.

jonlee112 commented 2 years ago

Thanks so much for your response!

This occurred on Python 3.6 and 3.8 with the following packages.

Package Version


alembic 1.7.5 anyio 2.2.0 argon2-cffi 20.1.0 async-generator 1.10 attrs 21.2.0 autovizwidget 0.18.0 Babel 2.9.1 backcall 0.2.0 backports.entry-points-selectable 1.1.1 bleach 4.0.0 blinker 1.4 Bottleneck 1.3.2 brotlipy 0.7.0 certifi 2021.10.8 certipy 0.1.3 cffi 1.15.0 charset-normalizer 2.0.4 click 8.0.3 click-config-file 0.6.0 click-plugins 1.1.1 cloudpickle 2.0.0 colorama 0.4.4 configobj 5.0.6 cryptography 3.4.8 cycler 0.11.0 Cython 0.29.14 debugpy 1.5.1 decorator 5.1.0 defusedxml 0.7.1 deprecation 2.1.0 distlib 0.3.4 entrypoints 0.3 filelock 3.4.0 fonttools 4.25.0 gensim 3.8.3 greenlet 1.1.1 hdbscan 0.8.27 hdijupyterutils 0.19.0 humanize 3.13.1 idna 3.3 importlib-metadata 4.8.2 importlib-resources 5.2.0 ipykernel 6.4.1 ipympl 0.7.0 ipyparallel 6.3.0 ipython 7.29.0 ipython-genutils 0.2.0 ipywidgets 7.6.5 jedi 0.18.0 Jinja2 2.11.3 joblib 1.1.0 json5 0.9.6 jsonschema 3.2.0 jupyter 1.0.0 jupyter-client 7.0.6 jupyter-console 6.4.0 jupyter-core 4.9.1 jupyter-dashboards-bundlers 0.9.1 jupyter-kernel-gateway 2.5.0 jupyter-packaging 0.10.4 jupyter-server 1.4.1 jupyter-telemetry 0.1.0 jupyterhub 1.4.2 jupyterhub-ldapauthenticator 1.3.2 jupyterlab 3.2.1 jupyterlab-launcher 0.13.1 jupyterlab-pygments 0.1.2 jupyterlab-server 2.8.2 jupyterlab-widgets 1.0.0 kiwisolver 1.3.1 ldap3 2.9.1 llvmlite 0.37.0 Mako 1.1.4 MarkupSafe 2.0.1 matplotlib 3.5.0 matplotlib-inline 0.1.2 metakernel 0.27.5 mistune 0.8.4 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 mock 4.0.3 more-itertools 8.12.0 munkres 1.1.4 nb-conda 2.2.1 nb-conda-kernels 2.3.1 nbclassic 0.2.6 nbclient 0.5.3 nbconvert 6.1.0 nbformat 5.1.3 nbserverproxy 0.8.8 nest-asyncio 1.5.1 nose 1.3.7 notebook 6.4.6 numba 0.54.1 numexpr 2.7.3 numpy 1.20.0 oauthlib 3.1.1 olefile 0.46 packaging 21.3 pandas 1.3.4 pandocfilters 1.4.3 parso 0.8.2 pexpect 4.8.0 pickleshare 0.7.5 Pillow 8.4.0 pip 21.2.2 pivottablejs 0.9.0 platformdirs 2.4.0 plotly 4.14.3 portalocker 2.3.0 prometheus-client 0.12.0 prompt-toolkit 3.0.20 psutil 5.8.0 ptyprocess 0.7.0 pyasn1 0.4.8 pycparser 2.21 pycurl 7.44.1 Pygments 2.10.0 PyJWT 2.1.0 pynndescent 0.5.5 pyOpenSSL 21.0.0 pyparsing 3.0.4 pyrsistent 0.18.0 PySocks 1.7.1 python-dateutil 2.8.2 python-json-logger 2.0.1 pytz 2021.3 pywin32 228 pywinpty 0.5.7 pyzmq 22.3.0 qgrid 1.3.1 qtconsole 5.1.1 QtPy 1.10.0 requests 2.26.0 requests-kerberos 0.12.0 requests-oauthlib 1.3.0 retrying 1.3.3 rise 5.7.1 ruamel.yaml 0.16.12 ruamel.yaml.clib 0.2.2 scikit-learn 1.0.1 scipy 1.7.3 Send2Trash 1.8.0 setuptools 58.0.4 sip 4.19.13 six 1.16.0 smart-open 5.2.1 sniffio 1.2.0 sparkmagic 0.19.0 spyder-kernels 2.0.5 SQLAlchemy 1.4.27 terminado 0.9.4 testpath 0.5.0 threadpoolctl 3.0.0 tomlkit 0.7.2 top2vec 1.0.26 tornado 6.1 tqdm 4.62.3 traitlets 5.1.1 twarc 2.8.2 twarc-csv 0.5.1 umap-learn 0.5.2 urllib3 1.26.7 virtualenv 20.10.0 wcwidth 0.2.5 webencodings 0.5.1 wheel 0.37.0 widgetsnbextension 3.5.1 win-inet-pton 1.1.0 wincertstore 0.2 winkerberos 0.7.0 wordcloud 1.8.1 zipp 3.6.0

ddangelov commented 2 years ago

Are you on Windows, if so have a look at this: https://github.com/scikit-learn-contrib/hdbscan/issues/22

jonlee112 commented 2 years ago

I've tried on my own computer (windows), as well as on Google Colab (not sure what environment that uses).

As a relatively inexperienced coder, I also think this from the joblibs package may be related/helpful: Serialization of un-picklable objects. Perhaps the most updated of info on this found in their docs, last updated Nov 2021: see: "Serialization of un-picklable objects" section

This article seems to suggest that sometimes with large python objects (ie, a large 'list' or 'dict'), default joblibs behavior can run into trouble. As a solution it offers some additional functions like the "wrap_non_picklable_objects()" wrapper function which may solve this issue? ... though I'm not sure how I would go about digging into the code to try it.

Perhaps this article gives you something simple with which to adjust how your code uses joblibs when there is a failure of un-serialization?

jonlee112 commented 2 years ago

Also found this: Serialization error when using parallelism in cross_val_score with GridSearchCV and a custom estimator where I found the following comments:

" I also have this problem using: Python 3.7.1 scikit-learn 0.20.1

but n_jobs=1 helps "

"using n_jobs = 1 , solved the error on windows 10 ,python 3.5"

"you need to install the joblib package. If that error still happens, set n_jobs = None."

" Having same issues with sklearn v0.20.2 Resolved this by updating to v0.22.2

Hope it helps. "

I tried using an older version of scikit-learn and actually it did work! pip install scikit-learn==0.24.0

So for now at least, it appears this fixed it.