Closed Y1ran closed 2 years ago
Strange, I have not seen that error before. Could you perhaps provide the following additional information?
dataset
, so the output of type(dataset)
I have the exact same problem.
Code:
embedding_model = TransformerDocumentEmbeddings('KB/bert-base-swedish-cased')
vectorizer_model = CountVectorizer(stop_words=stopwords)
topic_model = BERTopic(embedding_model=embedding_model, vectorizer_model=vectorizer_model)
topic_model.fit(docs)
Dependencies versions: transformers: 4.19.1 umap-learn: 0.5.3 hdbscan: 0.8.28 sentence-transformers: 2.20 numpy: 1.21.6
dataset type: pandas.core.series.Series It contains 150k documents, if that makes a difference.
Thank you, here is the info:
code:
roberta = TransformerDocumentEmbeddings('hfl/chinese-roberta-wwm-ext')
if roberta:
model = BERTopic(embedding_model=roberta, verbose=True, low_memory=True, n_gram_range=self.n_gram_range,
min_topic_size=self.min_topic_size, diversity=self.diversity)
else:
model = BERTopic(embedding_model="all-MiniLM-L6-v2", language="english", calculate_probabilities=True,
n_gram_range=self.n_gram_range, nr_topics='auto', min_topic_size=self.min_topic_size,
diversity=self.diversity, verbose=True) # embedding can be any language
if len(self.dataset) < 100:
raise Exception(f"Too less feeds are fetched ({len(self.dataset)}<100), please set a longer day period.")
Dataset:
Dependencies versions: transformers: 4.17.0 umap-learn: 0.5.2 hdbscan: 0.8.28 sentence-transformers: 2.2.0 numpy: 1.20.1
@ClemHFandango
dataset type: pandas.core.series.Series
The input should be a list of strings, not a pandas series. Converting it to a list of strings should solve your issue!
@AlanTur1ng
Ah, it seems that the default tokenizer will not work for you due to the text that you are using. A different tokenizer is needed to convert the Chinese characters into tokens, which is typically done with jieba
. You can find the corresponding tutorial here.
@MaartenGr the problem still seems to persist not only when I pass the input in as a list, but also when I follow the basic example in the tutorial and try and use a different embedding model as shown:
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)
topics, probs = topic_model.fit_transform(docs)
@ClemHFandango It might be related to your environment as I am running your code without any issues in a Kaggle notebook session. Could you start from a completely fresh environment and try again?
@MaartenGr In a completely fresh virtual environment I still get the same error. This is with Python 3.9.12, the complete list of installed packages:
Package Version
--------------------- -----------
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
asttokens 2.0.5
attrs 21.4.0
backcall 0.2.0
beautifulsoup4 4.11.1
bertopic 0.10.0
bleach 5.0.0
bpemb 0.3.3
certifi 2021.10.8
cffi 1.15.0
charset-normalizer 2.0.12
click 8.1.3
cloudpickle 2.0.0
conllu 4.4.2
cycler 0.11.0
Cython 0.29.30
debugpy 1.6.0
decorator 5.1.1
defusedxml 0.7.1
Deprecated 1.2.13
entrypoints 0.4
executing 0.8.3
fastjsonschema 2.15.3
filelock 3.7.0
flair 0.11.2
fonttools 4.33.3
ftfy 6.1.1
future 0.18.2
gdown 3.12.2
gensim 4.2.0
hdbscan 0.8.28
huggingface-hub 0.6.0
hyperopt 0.2.7
idna 3.3
importlib-metadata 3.10.1
ipykernel 6.13.0
ipython 8.3.0
ipython-genutils 0.2.0
ipywidgets 7.7.0
Janome 0.4.2
jedi 0.18.1
Jinja2 3.1.2
joblib 1.1.0
jsonschema 4.5.1
jupyter 1.0.0
jupyter-client 7.3.1
jupyter-console 6.4.3
jupyter-core 4.10.0
jupyterlab-pygments 0.2.2
jupyterlab-widgets 1.1.0
kiwisolver 1.4.2
konoha 4.6.5
langdetect 1.0.9
llvmlite 0.38.0
lxml 4.8.0
MarkupSafe 2.1.1
matplotlib 3.5.2
matplotlib-inline 0.1.3
mistune 0.8.4
more-itertools 8.13.0
mpld3 0.3
nbclient 0.6.3
nbconvert 6.5.0
nbformat 5.4.0
nest-asyncio 1.5.5
networkx 2.8
nltk 3.7
notebook 6.4.11
numba 0.55.1
numpy 1.21.6
overrides 3.1.0
packaging 21.3
pandas 1.4.2
pandocfilters 1.5.0
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.1.1
pip 22.0.4
plotly 5.8.0
pptree 3.1
prometheus-client 0.14.1
prompt-toolkit 3.0.29
psutil 5.9.0
ptyprocess 0.7.0
pure-eval 0.2.2
py4j 0.10.9.5
pycparser 2.21
Pygments 2.12.0
pynndescent 0.5.7
pyparsing 3.0.9
pyrsistent 0.18.1
PySocks 1.7.1
python-dateutil 2.8.2
pytz 2022.1
PyYAML 5.4.1
pyzmq 22.3.0
qtconsole 5.3.0
QtPy 2.1.0
regex 2022.4.24
requests 2.27.1
scikit-learn 1.1.0
scipy 1.8.0
segtok 1.5.11
Send2Trash 1.8.0
sentence-transformers 2.2.0
sentencepiece 0.1.95
setuptools 58.1.0
six 1.16.0
sklearn 0.0
smart-open 6.0.0
soupsieve 2.3.2.post1
sqlitedict 2.0.0
stack-data 0.2.0
tabulate 0.8.9
tenacity 8.0.1
terminado 0.15.0
threadpoolctl 3.1.0
tinycss2 1.1.1
tokenizers 0.12.1
torch 1.11.0
torchvision 0.12.0
tornado 6.1
tqdm 4.64.0
traitlets 5.2.1.post0
transformers 4.19.2
typing_extensions 4.2.0
umap-learn 0.5.3
urllib3 1.26.9
wcwidth 0.2.5
webencodings 0.5.1
widgetsnbextension 3.6.0
Wikipedia-API 0.5.4
wrapt 1.14.1
zipp 3.8.0
@ClemHFandango It seems that the new environment does contain quite a number of packages that should not be relevant to the installation of BERTopic. Perhaps there is some interaction between packages that results in this issue. When you create a new environment, could you only install BERTopic there and then try out the example? Hopefully, this helps us identify what exactly is going wrong here.
The problem it seems came from version 0.11 of flair, downgrading to 0.10 fixed the issue.
Due to inactivity, this issue will be closed. Feel free to ping me if you want to re-open the issue!
Hi there, I tried to input a data list, which contains 250k text sequence, as the input of model.fit_transform(dataset), it gives following error:
whereas model works fine when dataset is less(no more than 10k usually). Hopefully this can be solved, looking forward to your help~