MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.78k stars 720 forks source link

Guided Modeling: Problem with seed_topic_list #1991

Open HeinzJS opened 1 month ago

HeinzJS commented 1 month ago

Hello,

I've been having problems with performing a guided topic approach. The error I have been receiving is as such:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

The part of my code is as follows:

from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, PartOfSpeech, MaximalMarginalRelevance

# Guided Model
seed_topic_list = [['hacking', 'hackers', 'hacked', 'lost', 'account'], 
                   ['data', 'leak', 'permissions', 'unauthorised', 'privacy'], 
                   ['bugs', 'crash', 'ddos', 'server', 'virus'], 
                   ['username', 'password', 'name', 'credit', 'email'], 
                   ['oculus', 'htc', 'windows', 'mac', 'meta']]

g_main_representation_model = KeyBERTInspired()
g_aspect_representation_model1 = PartOfSpeech("en_core_web_sm")
g_aspect_representation_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]

g_representation_model = {
   "Main": g_main_representation_model,
   "Aspect1":  g_aspect_representation_model1,
   "Aspect2":  g_aspect_representation_model2 
}

g_vectorizer_model = CountVectorizer(min_df=5, stop_words = 'english')
g_topic_mdl_rec = BERTopic(nr_topics = 'auto', vectorizer_model = g_vectorizer_model,
                      representation_model = g_representation_model, seed_topic_list=seed_topic_list)
g_topics_rec, g_ini_probs_rec = g_topic_mdl_rec.fit_transform(rec_room_reviews)

The solutions I have tried:

I can't find any other resources online about this so thought I would open a new one.

Here are a list of my current packages

Package                   Version
------------------------- -----------
aiohttp                   3.9.3
aiosignal                 1.3.1
annotated-types           0.6.0
anyio                     4.3.0
appdirs                   1.4.4
asn1crypto                1.5.1
asttokens                 2.4.1
async-timeout             4.0.3
attrs                     23.2.0
Automat                   22.10.0
bcrypt                    4.1.2
beautifulsoup4            4.12.3
bertopic                  0.16.1
blis                      0.7.11
botocore                  1.34.54
catalogue                 2.0.10
certifi                   2024.2.2
cffi                      1.16.0
charset-normalizer        3.3.2
click                     8.1.7
cloudpathlib              0.16.0
comm                      0.2.2
confection                0.1.4
constantly                23.10.4
contourpy                 1.2.1
cryptography              42.0.5
cssselect                 1.2.0
cycler                    0.12.1
cymem                     2.0.8
Cython                    0.29.37
debugpy                   1.8.1
decorator                 5.1.1
deep-translator           1.11.4
distro                    1.9.0
docutils                  0.20.1
en-core-web-sm            3.7.1
et-xmlfile                1.1.0
exceptiongroup            1.2.0
executing                 2.0.1
fastjsonschema            2.19.1
filelock                  3.13.1
fonttools                 4.51.0
frozenlist                1.4.1
fsspec                    2024.3.1
gensim                    4.3.2
h11                       0.14.0
hdbscan                   0.8.33
httpcore                  1.0.4
httpx                     0.27.0
huggingface-hub           0.23.0
hyperlink                 21.0.0
idna                      3.6
incremental               22.10.0
ipykernel                 6.29.4
ipympl                    0.9.4
ipython                   8.23.0
ipython-genutils          0.2.0
ipywidgets                8.1.2
itemadapter               0.8.0
itemloaders               1.1.0
jedi                      0.19.1
Jinja2                    3.1.3
jmespath                  1.0.1
joblib                    1.4.2
jsonschema                4.22.0
jsonschema-specifications 2023.12.1
jupyter_client            8.6.1
jupyter_core              5.7.2
jupyterlab_widgets        3.0.10
kiwisolver                1.4.5
langcodes                 3.4.0
langdetect                1.0.9
language_data             1.2.0
llvmlite                  0.39.1
lxml                      5.1.0
marisa-trie               1.1.1
MarkupSafe                2.1.5
matplotlib                3.8.4
matplotlib-inline         0.1.7
mpmath                    1.3.0
multidict                 6.0.5
murmurhash                1.0.10
nbformat                  5.10.4
nest-asyncio              1.6.0
networkx                  3.3
numba                     0.56.0
numpy                     1.22.4
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.20.5
nvidia-nvjitlink-cu12     12.4.127
nvidia-nvtx-cu12          12.1.105
openai                    1.14.1
openpyxl                  3.1.2
packaging                 23.2
pandas                    2.2.1
paramiko                  3.4.0
parsel                    1.8.1
parso                     0.8.4
pexpect                   4.9.0
pillow                    10.3.0
pip                       24.0
platformdirs              4.2.1
plotly                    5.22.0
preshed                   3.0.9
prompt-toolkit            3.0.43
Protego                   0.3.0
psutil                    5.9.8
ptyprocess                0.7.0
pure-eval                 0.2.2
pyasn1                    0.5.1
pyasn1-modules            0.3.0
pycparser                 2.21
pydantic                  2.6.4
pydantic_core             2.16.3
PyDispatcher              2.0.7
Pygments                  2.17.2
PyNaCl                    1.5.0
pynndescent               0.5.12
pyOpenSSL                 24.0.0
pyparsing                 3.1.1
PySocks                   1.7.1
python-dateutil           2.9.0.post0
python-docx               1.1.0
python-dotenv             1.0.1
pytz                      2024.1
PyYAML                    6.0.1
pyzmq                     26.0.2
queuelib                  1.6.2
referencing               0.35.1
regex                     2024.4.28
requests                  2.31.0
requests-file             2.0.0
rpds-py                   0.18.0
safetensors               0.4.3
scikit-learn              1.4.2
scipy                     1.10.1
Scrapy                    2.11.1
seaborn                   0.13.2
sentence-transformers     2.7.0
service-identity          24.1.0
setuptools                69.5.1
six                       1.16.0
smart-getenv              1.1.0
smart-open                6.4.0
sniffio                   1.3.1
soupsieve                 2.5
spacy                     3.7.4
spacy-legacy              3.0.12
spacy-loggers             1.0.5
srsly                     2.4.8
stack-data                0.6.3
sympy                     1.12
tenacity                  8.2.3
thinc                     8.2.3
threadpoolctl             3.5.0
tldextract                5.1.1
tokenizers                0.19.1
torch                     2.3.0
tornado                   6.4
tqdm                      4.66.2
traitlets                 5.14.3
transformers              4.40.1
triton                    2.3.0
Twisted                   24.3.0
typer                     0.9.4
typing_extensions         4.10.0
tzdata                    2024.1
umap-learn                0.5.6
urllib3                   2.0.7
vaderSentiment            3.3.2
virustotal-python         1.0.2
vt-graph-api              2.2.0
vt-py                     0.18.0
w3lib                     2.1.2
wasabi                    1.1.2
wcwidth                   0.2.13
weasel                    0.3.4
wheel                     0.43.0
widgetsnbextension        4.0.10
wordcloud                 1.9.3
wrapt                     1.16.0
yarl                      1.9.4
zope.interface            6.2
MaartenGr commented 1 month ago

I'm not sure but I believe you have to set the numpy version even lower which is by no means an ideal solution. Instead, it might be worthwhile to use zero-shot BERTopic instead.

HeinzJS commented 1 month ago

Thank you for the reply.

I have previously tried setting both numpy to 1.23.5 and numba to 0.56.0, but it gave me an error:

numba 0.56.0 requires numpy<1.23,>=1.18, but you have numpy 1.23.5 which is incompatible.

As for numba 0.56.4 it still produced the same error as the original post.

I'll have a look at zero-shot approach. Thank you once again!