occurs a problem when input a big dataset #536

Closed Y1ran closed 2 years ago

Y1ran commented 2 years ago

Hi there, I tried to input a data list, which contains 250k text sequence, as the input of model.fit_transform(dataset), it gives following error: image

whereas model works fine when dataset is less(no more than 10k usually). Hopefully this can be solved, looking forward to your help~

MaartenGr commented 2 years ago

Strange, I have not seen that error before. Could you perhaps provide the following additional information?

ClemHFandango commented 2 years ago

I have the exact same problem.


embedding_model = TransformerDocumentEmbeddings('KB/bert-base-swedish-cased')
vectorizer_model = CountVectorizer(stop_words=stopwords)
topic_model = BERTopic(embedding_model=embedding_model, vectorizer_model=vectorizer_model)

Dependencies versions: transformers: 4.19.1 umap-learn: 0.5.3 hdbscan: 0.8.28 sentence-transformers: 2.20 numpy: 1.21.6

dataset type: pandas.core.series.Series It contains 150k documents, if that makes a difference.

Y1ran commented 2 years ago

Thank you, here is the info:


roberta = TransformerDocumentEmbeddings('hfl/chinese-roberta-wwm-ext')
if roberta:
    model = BERTopic(embedding_model=roberta, verbose=True, low_memory=True, n_gram_range=self.n_gram_range,
                     min_topic_size=self.min_topic_size, diversity=self.diversity)
    model = BERTopic(embedding_model="all-MiniLM-L6-v2", language="english", calculate_probabilities=True,
                     n_gram_range=self.n_gram_range, nr_topics='auto', min_topic_size=self.min_topic_size,
                     diversity=self.diversity, verbose=True)  # embedding can be any language
if len(self.dataset) < 100:
    raise Exception(f"Too less feeds are fetched ({len(self.dataset)}<100), please set a longer day period.")

Dataset: image

Dependencies versions: transformers: 4.17.0 umap-learn: 0.5.2 hdbscan: 0.8.28 sentence-transformers: 2.2.0 numpy: 1.20.1

MaartenGr commented 2 years ago


dataset type: pandas.core.series.Series

The input should be a list of strings, not a pandas series. Converting it to a list of strings should solve your issue!


Ah, it seems that the default tokenizer will not work for you due to the text that you are using. A different tokenizer is needed to convert the Chinese characters into tokens, which is typically done with jieba. You can find the corresponding tutorial here.

ClemHFandango commented 2 years ago

@MaartenGr the problem still seems to persist not only when I pass the input in as a list, but also when I follow the basic example in the tutorial and try and use a different embedding model as shown:

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)
topics, probs = topic_model.fit_transform(docs)
MaartenGr commented 2 years ago

@ClemHFandango It might be related to your environment as I am running your code without any issues in a Kaggle notebook session. Could you start from a completely fresh environment and try again?

ClemHFandango commented 2 years ago

@MaartenGr In a completely fresh virtual environment I still get the same error. This is with Python 3.9.12, the complete list of installed packages:

MaartenGr commented 2 years ago

@ClemHFandango It seems that the new environment does contain quite a number of packages that should not be relevant to the installation of BERTopic. Perhaps there is some interaction between packages that results in this issue. When you create a new environment, could you only install BERTopic there and then try out the example? Hopefully, this helps us identify what exactly is going wrong here.

ClemHFandango commented 2 years ago

The problem it seems came from version 0.11 of flair, downgrading to 0.10 fixed the issue.

MaartenGr commented 2 years ago

Due to inactivity, this issue will be closed. Feel free to ping me if you want to re-open the issue!