occurs a problem when input a big dataset

Y1ran commented 2 years ago

Hi there, I tried to input a data list, which contains 250k text sequence, as the input of model.fit_transform(dataset), it gives following error:

whereas model works fine when dataset is less(no more than 10k usually). Hopefully this can be solved, looking forward to your help~

MaartenGr commented 2 years ago

Strange, I have not seen that error before. Could you perhaps provide the following additional information?

The code for training BERTopic
The version of BERTopic, including its dependencies (transformers, umap, hdbscan, sentence-transformers, numpy)
The variable type of dataset, so the output of type(dataset)

ClemHFandango commented 2 years ago

I have the exact same problem.

Code:

embedding_model = TransformerDocumentEmbeddings('KB/bert-base-swedish-cased')
vectorizer_model = CountVectorizer(stop_words=stopwords)
topic_model = BERTopic(embedding_model=embedding_model, vectorizer_model=vectorizer_model)
topic_model.fit(docs)

Dependencies versions: transformers: 4.19.1 umap-learn: 0.5.3 hdbscan: 0.8.28 sentence-transformers: 2.20 numpy: 1.21.6

dataset type: pandas.core.series.Series It contains 150k documents, if that makes a difference.

Y1ran commented 2 years ago

Thank you, here is the info:

code:

roberta = TransformerDocumentEmbeddings('hfl/chinese-roberta-wwm-ext')
if roberta:
    model = BERTopic(embedding_model=roberta, verbose=True, low_memory=True, n_gram_range=self.n_gram_range,
                     min_topic_size=self.min_topic_size, diversity=self.diversity)
else:
    model = BERTopic(embedding_model="all-MiniLM-L6-v2", language="english", calculate_probabilities=True,
                     n_gram_range=self.n_gram_range, nr_topics='auto', min_topic_size=self.min_topic_size,
                     diversity=self.diversity, verbose=True)  # embedding can be any language
if len(self.dataset) < 100:
    raise Exception(f"Too less feeds are fetched ({len(self.dataset)}<100), please set a longer day period.")

Dataset:

Dependencies versions: transformers: 4.17.0 umap-learn: 0.5.2 hdbscan: 0.8.28 sentence-transformers: 2.2.0 numpy: 1.20.1

MaartenGr commented 2 years ago

@ClemHFandango

dataset type: pandas.core.series.Series

The input should be a list of strings, not a pandas series. Converting it to a list of strings should solve your issue!

@AlanTur1ng

Ah, it seems that the default tokenizer will not work for you due to the text that you are using. A different tokenizer is needed to convert the Chinese characters into tokens, which is typically done with jieba. You can find the corresponding tutorial here.

ClemHFandango commented 2 years ago

@MaartenGr the problem still seems to persist not only when I pass the input in as a list, but also when I follow the basic example in the tutorial and try and use a different embedding model as shown:

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)
topics, probs = topic_model.fit_transform(docs)

MaartenGr commented 2 years ago

@ClemHFandango It might be related to your environment as I am running your code without any issues in a Kaggle notebook session. Could you start from a completely fresh environment and try again?

ClemHFandango commented 2 years ago

@MaartenGr In a completely fresh virtual environment I still get the same error. This is with Python 3.9.12, the complete list of installed packages:

Package               Version
--------------------- -----------
argon2-cffi           21.3.0
argon2-cffi-bindings  21.2.0
asttokens             2.0.5
attrs                 21.4.0
backcall              0.2.0
beautifulsoup4        4.11.1
bertopic              0.10.0
bleach                5.0.0
bpemb                 0.3.3
certifi               2021.10.8
cffi                  1.15.0
charset-normalizer    2.0.12
click                 8.1.3
cloudpickle           2.0.0
conllu                4.4.2
cycler                0.11.0
Cython                0.29.30
debugpy               1.6.0
decorator             5.1.1
defusedxml            0.7.1
Deprecated            1.2.13
entrypoints           0.4
executing             0.8.3
fastjsonschema        2.15.3
filelock              3.7.0
flair                 0.11.2
fonttools             4.33.3
ftfy                  6.1.1
future                0.18.2
gdown                 3.12.2
gensim                4.2.0
hdbscan               0.8.28
huggingface-hub       0.6.0
hyperopt              0.2.7
idna                  3.3
importlib-metadata    3.10.1
ipykernel             6.13.0
ipython               8.3.0
ipython-genutils      0.2.0
ipywidgets            7.7.0
Janome                0.4.2
jedi                  0.18.1
Jinja2                3.1.2
joblib                1.1.0
jsonschema            4.5.1
jupyter               1.0.0
jupyter-client        7.3.1
jupyter-console       6.4.3
jupyter-core          4.10.0
jupyterlab-pygments   0.2.2
jupyterlab-widgets    1.1.0
kiwisolver            1.4.2
konoha                4.6.5
langdetect            1.0.9
llvmlite              0.38.0
lxml                  4.8.0
MarkupSafe            2.1.1
matplotlib            3.5.2
matplotlib-inline     0.1.3
mistune               0.8.4
more-itertools        8.13.0
mpld3                 0.3
nbclient              0.6.3
nbconvert             6.5.0
nbformat              5.4.0
nest-asyncio          1.5.5
networkx              2.8
nltk                  3.7
notebook              6.4.11
numba                 0.55.1
numpy                 1.21.6
overrides             3.1.0
packaging             21.3
pandas                1.4.2
pandocfilters         1.5.0
parso                 0.8.3
pexpect               4.8.0
pickleshare           0.7.5
Pillow                9.1.1
pip                   22.0.4
plotly                5.8.0
pptree                3.1
prometheus-client     0.14.1
prompt-toolkit        3.0.29
psutil                5.9.0
ptyprocess            0.7.0
pure-eval             0.2.2
py4j                  0.10.9.5
pycparser             2.21
Pygments              2.12.0
pynndescent           0.5.7
pyparsing             3.0.9
pyrsistent            0.18.1
PySocks               1.7.1
python-dateutil       2.8.2
pytz                  2022.1
PyYAML                5.4.1
pyzmq                 22.3.0
qtconsole             5.3.0
QtPy                  2.1.0
regex                 2022.4.24
requests              2.27.1
scikit-learn          1.1.0
scipy                 1.8.0
segtok                1.5.11
Send2Trash            1.8.0
sentence-transformers 2.2.0
sentencepiece         0.1.95
setuptools            58.1.0
six                   1.16.0
sklearn               0.0
smart-open            6.0.0
soupsieve             2.3.2.post1
sqlitedict            2.0.0
stack-data            0.2.0
tabulate              0.8.9
tenacity              8.0.1
terminado             0.15.0
threadpoolctl         3.1.0
tinycss2              1.1.1
tokenizers            0.12.1
torch                 1.11.0
torchvision           0.12.0
tornado               6.1
tqdm                  4.64.0
traitlets             5.2.1.post0
transformers          4.19.2
typing_extensions     4.2.0
umap-learn            0.5.3
urllib3               1.26.9
wcwidth               0.2.5
webencodings          0.5.1
widgetsnbextension    3.6.0
Wikipedia-API         0.5.4
wrapt                 1.14.1
zipp                  3.8.0

MaartenGr commented 2 years ago

@ClemHFandango It seems that the new environment does contain quite a number of packages that should not be relevant to the installation of BERTopic. Perhaps there is some interaction between packages that results in this issue. When you create a new environment, could you only install BERTopic there and then try out the example? Hopefully, this helps us identify what exactly is going wrong here.

ClemHFandango commented 2 years ago

The problem it seems came from version 0.11 of flair, downgrading to 0.10 fixed the issue.

MaartenGr commented 2 years ago

Due to inactivity, this issue will be closed. Feel free to ping me if you want to re-open the issue!

MaartenGr / BERTopic

occurs a problem when input a big dataset #536