MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.87k stars 729 forks source link

Can't run BERTopic with Python 3.9.5 - 64 Bits #206

Closed doubianimehdi closed 2 years ago

doubianimehdi commented 2 years ago

Hi,

Whenever I try to run this code : from bertopic import BERTopic

topic_model = BERTopic(verbose=True, embedding_model="paraphrase-TinyBERT-L6-v2", min_topicsize=25) topics, = topic_model.fit_transform(df_reduced['abstract']); len(topic_model.get_topic_info())

If have the following error : KeyError: 71947

KeyError Traceback (most recent call last) ~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3360 try: -> 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err:

~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 71947

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_5924/2689151670.py in 2 3 topic_model = BERTopic(verbose=True, embedding_model="paraphrase-TinyBERT-L6-v2", min_topicsize=25) ----> 4 topics, = topic_model.fit_transform(df_reduced['abstract']); len(topic_model.get_topic_info())

~\AppData\Local\Programs\Python\Python39\lib\site-packages\bertopic_bertopic.py in fit_transform(self, documents, embeddings, y) 274 self.embedding_model = select_backend(self.embedding_model, 275 language=self.language) --> 276 embeddings = self._extract_embeddings(documents.Document, 277 method="document", 278 verbose=self.verbose)

~\AppData\Local\Programs\Python\Python39\lib\site-packages\bertopic_bertopic.py in _extract_embeddings(self, documents, method, verbose) 1322 embeddings = self.embedding_model.embed_words(documents, verbose) 1323 elif method == "document": -> 1324 embeddings = self.embedding_model.embed_documents(documents, verbose) 1325 else: 1326 raise ValueError("Wrong method for extracting document/word embeddings. "

~\AppData\Local\Programs\Python\Python39\lib\site-packages\bertopic\backend_base.py in embed_documents(self, document, verbose) 67 that each have an embeddings size of m 68 """ ---> 69 return self.embed(document, verbose)

~\AppData\Local\Programs\Python\Python39\lib\site-packages\bertopic\backend_sentencetransformers.py in embed(self, documents, verbose) 61 that each have an embeddings size of m 62 """ ---> 63 embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose) 64 return embeddings

~\AppData\Local\Programs\Python\Python39\lib\site-packages\sentence_transformers\SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, device, normalize_embeddings) 150 all_embeddings = [] 151 length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences]) --> 152 sentences_sorted = [sentences[idx] for idx in length_sorted_idx] 153 154 for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):

~\AppData\Local\Programs\Python\Python39\lib\site-packages\sentence_transformers\SentenceTransformer.py in (.0) 150 all_embeddings = [] 151 length_sorted_idx = np.argsort([-self._text_length(sen) for sen in sentences]) --> 152 sentences_sorted = [sentences[idx] for idx in length_sorted_idx] 153 154 for start_index in trange(0, len(sentences), batch_size, desc="Batches", disable=not show_progress_bar):

~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py in getitem(self, key) 940 941 elif key_is_scalar: --> 942 return self._get_value(key) 943 944 if is_hashable(key):

~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable) 1049 1050 # Similar to Index.get_value, but we do not fall back to positional -> 1051 loc = self.index.get_loc(label) 1052 return self.index._get_values_for_loc(self, loc, label) 1053

~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 71947

I run Python 3.9.5 - 64 bit and below is my pip list :

WARNING: Ignoring invalid distribution -n-core-web-sm (c:\users\doub2420\appdata\local\programs\python\python39\lib\site-packages) Package Version


absl-py 0.13.0 aiohttp 3.7.4.post0
alembic 1.4.1 altair 4.1.0 anyascii 0.2.0 anyio 3.2.1 appdirs 1.4.4 argon2-cffi 20.1.0 astor 0.8.1 astunparse 1.6.3 async-generator 1.10 async-timeout 3.0.1 attrs 21.2.0 Babel 2.9.1 backcall 0.2.0 base58 2.1.0 beautifulsoup4 4.9.3 bertopic 0.9.0 bleach 3.3.0 blinker 1.4 blis 0.7.4 blosc 1.10.4 bokeh 2.3.2 Boruta 0.3 Bottleneck 1.3.2 CacheControl 0.12.6 cachetools 4.2.2 cachy 0.3.0 catalogue 2.0.5 catboost 0.26 certifi 2021.5.30 cffi 1.14.5 chardet 3.0.4 cleo 0.8.1 click 7.1.2 clikit 0.6.2 clock 0.1 cloudpickle 1.6.0 colorama 0.4.4 colorlover 0.3.0 contractions 0.0.52 cramjam 2.3.2 crashtest 0.3.1 cryptography 3.4.7 cufflinks 0.17.3 curses 2.2.1+utf8 cycler 0.10.0 cymem 2.0.5 Cython 0.29.24 dask 2021.6.2 dask-labextension 5.0.2 data-dashboard 0.1.1 databricks-cli 0.14.3 dataclasses 0.6 debugpy 1.4.1 decorator 4.4.2 defusedxml 0.7.1 distlib 0.3.2 distributed 2021.6.2 docker 5.0.0 docopt 0.6.2 docx2txt 0.8 en-core-web-sm 3.1.0 entrypoints 0.3 evidently 0.1.22.dev0 faiss-cpu 1.7.1.post2 filelock 3.0.12 flake8 3.9.2 FLAML 0.5.9 Flask 2.0.1 flatbuffers 1.12 fsspec 2021.6.1 funcy 1.16 future 0.18.2 gast 0.4.0 gensim 3.8.3 gitdb 4.0.7 GitPython 3.1.18 google-auth 1.32.0 google-auth-oauthlib 0.4.4 google-pasta 0.2.0 graphviz 0.16 greenlet 1.1.0 grpcio 1.34.1 h11 0.9.0 h2 3.2.0 h5py 3.1.0 hdbscan 0.8.27 HeapDict 1.0.1 hpack 3.0.0 hstspreload 2021.7.5 html5lib 1.1 htmlmin 0.1.12 httpcore 0.9.1 huggingface-hub 0.0.12 hyperframe 5.2.0 idna 2.10 ImageHash 4.2.1 imbalanced-learn 0.7.0 importlib-resources 5.2.0 inflect 5.3.0 ipykernel 6.0.3 ipython 7.24.1 ipython-genutils 0.2.0 ipywidgets 7.6.3 itsdangerous 2.0.1 jedi 0.18.0 jellyfish 0.8.2 Jinja2 3.0.1 joblib 1.0.1 Js2Py 0.71 json5 0.9.6 jsonschema 3.2.0 jupyter-client 6.1.12 jupyter-core 4.7.1 jupyter-server 1.9.0 jupyter-server-proxy 3.1.0 jupyterlab 3.0.16 jupyterlab-pygments 0.1.2 jupyterlab-server 2.6.0 jupyterlab-widgets 1.0.0 keras-nightly 2.5.0.dev2021032900 Keras-Preprocessing 1.1.2 keyring 21.8.0 kiwisolver 1.3.1 kmodes 0.11.0 lckr-jupyterlab-variableinspector 3.0.9 lightgbm 3.2.1 littleutils 0.2.2 llvmlite 0.36.0 locket 0.2.1 lockfile 0.12.2 lxml 4.6.3 Mako 1.1.4 Markdown 3.3.4 MarkupSafe 2.0.1 matplotlib 3.4.2 matplotlib-inline 0.1.2 mccabe 0.6.1 missingno 0.5.0 mistune 0.8.4 mlflow 1.19.0 mlxtend 0.18.0 msgpack 1.0.2 multidict 5.1.0 multimethod 1.4 murmurhash 1.0.5 mysql 0.0.3 mysql-connector-python 8.0.25 mysqlclient 2.0.3 nbclassic 0.3.1 nbclient 0.5.3 nbconvert 6.1.0 nbformat 5.1.3 nest-asyncio 1.5.1 networkx 2.5.1 nltk 3.6.2 notebook 6.4.0 numba 0.53.1 numexpr 2.7.3 numpy 1.20.3 oauthlib 3.1.1 opt-einsum 3.3.0 outdated 0.2.1 packaging 20.9 pandas 1.3.1 pandas-flavor 0.2.0 pandas-profiling 3.0.0 pandas-read-xml 0.3.1 pandasql 0.7.3 pandocfilters 1.4.3 parso 0.8.2 partd 1.2.0 pastel 0.2.1 pathy 0.6.0 patsy 0.5.1 pdfminer.six 20201018 pexpect 4.8.0 phik 0.11.2 pickleshare 0.7.5 Pillow 8.2.0 pingouin 0.4.0 pip 21.2.4 pipenv 2021.5.29 pipenv-pipes 0.7.1 pipwin 0.5.1 pkginfo 1.7.0 plac 1.1.3 plotly 4.14.2 poetry 1.1.7 poetry-core 1.0.3 preshed 3.0.5 prometheus-client 0.11.0 prometheus-flask-exporter 0.18.2 prompt-toolkit 3.0.19 protobuf 3.17.3 psutil 5.8.0 ptyprocess 0.7.0 pyahocorasick 1.4.2 pyarrow 4.0.1 pyasn1 0.4.8 pyasn1-modules 0.2.8 pybind11 2.6.1 pycaret 2.3.2 pycodestyle 2.7.0 pycparser 2.20 pycryptodome 3.10.1 pydantic 1.8.2 pydash 5.0.2 pydeck 0.6.2 pyflakes 2.3.1 Pygments 2.9.0 pyjsparser 2.7.1 pyLDAvis 3.2.2 pylev 1.4.0 pynndescent 0.5.4 pyod 0.9.0 pyparsing 2.4.7 PyPrind 2.11.3 pyresparser 1.0.6 pyrsistent 0.17.3 pysbd 0.3.4 pySmartDL 1.3.4 python-dateutil 2.8.1 python-editor 1.0.4 pytz 2021.1 PyWavelets 1.1.1 pywedge 0.5.1.8 pywin32 227 pywin32-ctypes 0.2.0 pywinpty 1.1.3 PyYAML 5.4.1 pyzmq 22.1.0 querystring-parser 1.2.4 regex 2021.7.6 requests 2.25.1 requests-oauthlib 1.3.0 requests-toolbelt 0.9.1 requests-unixsocket 0.2.0 retrying 1.3.3 rfc3986 1.5.0 rsa 4.7.2 sacremoses 0.0.45 scikit-learn 0.23.2 scikit-plot 0.3.7 scipy 1.7.0 seaborn 0.11.1 segtok 1.5.10 Send2Trash 1.7.1 sentence-transformers 2.0.0 sentencepiece 0.1.96 setuptools 57.4.0 sgmllib3k 1.0.0 shellingham 1.4.0 simpervisor 0.4 simplified-scrapy 1.5.164 six 1.15.0 sklearn 0.0 smart-open 5.1.0 smmap 4.0.0 sniffio 1.2.0 sortedcontainers 2.4.0 soupsieve 2.2.1 spacy 3.1.1 spacy-legacy 3.0.8 SQLAlchemy 1.4.20 sqlparse 0.4.1 srsly 2.4.1 statsmodels 0.12.2 streamlit 0.83.0 sweetviz 2.1.2 tabulate 0.8.9 tangled-up-in-unicode 0.1.0 tblib 1.7.0 tenacity 7.0.0 tensorboard 2.5.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.0 tensorflow 2.5.0 tensorflow-estimator 2.5.0 termcolor 1.1.0 terminado 0.10.1 testpath 0.5.0 textblob 0.15.3 textsearch 0.0.21 thinc 8.0.8 threadpoolctl 2.1.0 thrift 0.13.0 tika 1.24 tokenizers 0.10.3 toml 0.10.2 tomlkit 0.7.2 toolz 0.11.1 torch 1.9.0 torchvision 0.10.0 tornado 6.1 tqdm 4.61.0 traitlets 5.0.5 transformers 4.9.2 typed-ast 1.4.3 typer 0.3.2 typing-extensions 3.7.4.3 tzlocal 2.1 umap-learn 0.5.1 urllib3 1.26.6 validators 0.18.2 virtualenv 20.4.7 virtualenv-clone 0.5.6 visions 0.7.1 waitress 2.0.0 wasabi 0.8.2 watchdog 2.1.2 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 1.1.0 Werkzeug 2.0.1 wheel 0.37.0 widgetsnbextension 3.5.1 wordcloud 1.8.1 wos-parser 0.1.dev0 wrapt 1.12.1 xarray 0.19.0 xmltodict 0.12.0 yake 0.4.8 yarl 1.6.3 yellowbrick 1.3.post1 zict 2.0.0 zipfile36 0.1.3 zipp 3.5.0 WARNING: Ignoring invalid distribution -n-core-web-sm (c:\users\doub2420\appdata\local\programs\python\python39\lib\site-packages) WARNING: Ignoring invalid distribution -n-core-web-sm (c:\users\doub2420\appdata\local\programs\python\python39\lib\site-packages)

I must be missing something ... thanks for your help !

MaartenGr commented 2 years ago

Ah, I think you should pass df_reduced['abstract'] as df_reduced['abstract'].tolist() since BERTopic only accepts a list of strings. Let me know if it works out!

doubianimehdi commented 2 years ago

Indeed that was it , I figured it out also this morning ! Thank you ! Maybe make an error message more explicit in that case ? Thanks again !