MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Setting nr_topics to 'auto' raises IndexError exception #1321

Closed piotrcelinski closed 1 year ago

piotrcelinski commented 1 year ago

Hi, I set the nr_topics to 'auto' in:

            self.topic_model = BERTopic(
                embedding_model=self.embedding_model,  # Step 1 - Extract embeddings
                umap_model=self.reduction_model,  # Step 2 - Reduce dimensionality
                hdbscan_model=self.hdbscan_model,  # Step 3 - Cluster reduced embeddings
                vectorizer_model=self.vectorizer_model,  # Step 4 - Tokenize topics
                ctfidf_model=self.ctfidf_model,  # Step 5 - Extract topic words
                # diversity=self.topic_model_diversity,       # Step 6 - Diversify topic words
                representation_model=self.representation_model,
                language=self.language,
                nr_topics=self.nr_topics,
                top_n_words=self.top_n_words,
                n_gram_range=self.n_gram_range,
                verbose=True
            )

and got IndexError: list index out of range. The traceback is below:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Users\piotr\onedrive\PycharmProjects\tematy-bert-310\slupsk-ria-tass.py:177 in <module>       │
│                                                                                                  │
│   174 │   logger.info(f'{Fore.GREEN}Quitting{Style.RESET_ALL}')                                  │
│   175                                                                                            │
│   176 if __name__ =='__main__':                                                                  │
│ ❱ 177 │   main()                                                                                 │
│   178                                                                                            │
│                                                                                                  │
│ C:\Users\piotr\onedrive\PycharmProjects\tematy-bert-310\slupsk-ria-tass.py:169 in main           │
│                                                                                                  │
│   166 │   │   │   │   │   │   │   │   f'{Fore.BLUE}LEMMA/ORIG:{Style.RESET_ALL} {lemorig}')      │
│   167 │   │   │   │   │   filename = re.sub(r'\.', '_', src) + f'_{lemorig}'                     │
│   168 │   │   │   │   │   stopwords = Filters.stopwords_ru + ['новости', 'новость', 'риа', 'та   │
│ ❱ 169 │   │   │   │   │   make_topics(corp_by_src[src][lemorig], corp_by_src[src]['date'], sto   │
│   170 │   │   │   │   │   │   │   │   filename, src, lemorig, 30)                                │
│   171 │   # my_topics = Topics()                                                                 │
│   172 │   # my_topics.load_model(f'{basepath}Ria_ru_lemma_Base_outliers_reduced.model')          │
│                                                                                                  │
│ C:\Users\piotr\onedrive\PycharmProjects\tematy-bert-310\slupsk-ria-tass.py:109 in make_topics    │
│                                                                                                  │
│   106 │   │   │   │   │      outliers_reduction_strategy='embeddings'                            │
│   107 │   │   │   │   │      )                                                                   │
│   108 │   logger.info(f'{Fore.GREEN}Topics search{Style.RESET_ALL}')                             │
│ ❱ 109 │   my_topics.topics_from_corpus(f'{filename}')                                            │
│   110 │   logger.info(f'{Fore.GREEN}Reducing outliers{Style.RESET_ALL}')                         │
│   111 │   my_topics.reduce_outliers(f'{filename}')                                               │
│   112 │   logger.info(f'{Fore.GREEN}Reducing topics to {nr_topics_reduced}{Style.RESET_ALL}')    │
│                                                                                                  │
│ C:\Users\piotr\OneDrive\python\pgc\nlp\topics.py:239 in topics_from_corpus                       │
│                                                                                                  │
│   236 │                                                                                          │
│   237 │   def topics_from_corpus(self, filename):                                                │
│   238 │   │   self.logger.info(f'{Fore.GREEN}Transforming corpus{Style.RESET_ALL}')              │
│ ❱ 239 │   │   self.topics, self.probs = self.topic_model.fit_transform(self.corpus)              │
│   240 │   │   # self.generate_and_set_topic_labels()                                             │
│   241 │   │   self.save_model_and_topics(filename, 'base')                                       │
│   242                                                                                            │
│                                                                                                  │
│ C:\Users\piotr\OneDrive\PycharmProjects\tematy-bert-310\lib\site-packages\bertopic\_bertopic.py: │
│ 415 in fit_transform                                                                             │
│                                                                                                  │
│    412 │   │   │                                                                                 │
│    413 │   │   │   # Reduce topics                                                               │
│    414 │   │   │   if self.nr_topics:                                                            │
│ ❱  415 │   │   │   │   documents = self._reduce_topics(documents)                                │
│    416 │   │   │                                                                                 │
│    417 │   │   │   # Save the top 3 most representative documents per topic                      │
│    418 │   │   │   self._save_representative_docs(documents)                                     │
│                                                                                                  │
│ C:\Users\piotr\OneDrive\PycharmProjects\tematy-bert-310\lib\site-packages\bertopic\_bertopic.py: │
│ 3605 in _reduce_topics                                                                           │
│                                                                                                  │
│   3602 │   │   │   if self.nr_topics < initial_nr_topics:                                        │
│   3603 │   │   │   │   documents = self._reduce_to_n_topics(documents)                           │
│   3604 │   │   elif isinstance(self.nr_topics, str):                                             │
│ ❱ 3605 │   │   │   documents = self._auto_reduce_topics(documents)                               │
│   3606 │   │   else:                                                                             │
│   3607 │   │   │   raise ValueError("nr_topics needs to be an int or 'auto'! ")                  │
│   3608                                                                                           │
│                                                                                                  │
│ C:\Users\piotr\OneDrive\PycharmProjects\tematy-bert-310\lib\site-packages\bertopic\_bertopic.py: │
│ 3671 in _auto_reduce_topics                                                                      │
│                                                                                                  │
│   3668 │   │   """                                                                               │
│   3669 │   │   topics = documents.Topic.tolist().copy()                                          │
│   3670 │   │   unique_topics = sorted(list(documents.Topic.unique()))[self._outliers:]           │
│ ❱ 3671 │   │   max_topic = unique_topics[-1]                                                     │
│   3672 │   │                                                                                     │
│   3673 │   │   # Find similar topics                                                             │
│   3674 │   │   if self.topic_embeddings_ is not None:                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
IndexError: list index out of range

What am I doing wrong? Piotr

MaartenGr commented 1 year ago

It might be that the initial number of topics that were created was already small and that there is something going on with the cluster model.

Could you provide your full code? It is difficult to see what you are exactly passing to the model. Also, which version of BERTopic are you using?

piotrcelinski commented 1 year ago

Hello, The sample contained 509 texts. BERTopic detected only the topic -1, the number of texts was 35 (looks strange for me). Bertopic version: 0.15.0. Parameters as below:

self.topic_model = BERTopic(
    embedding_model=SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
    umap_model= UMAP(
        n_neighbors=15,
        n_components=5,
        min_dist=0,0,
        metric='cosine'
        random_state=42
    )
    hdbscan_model= HDBSCAN(
        min_cluster_size=14,
        metric='euclidean',
        cluster_selection_method='eom',
        prediction_data=True
    )
    vectorizer_model=CountVectorizer(stop_words=[***LIST OF STOPWORDS HERE***]),  
    ctfidf_model=ClassTfidfTransformer(reduce_frequent_words=True),           
    representation_model=BaseRepresentation,
    language='multilingual',
    nr_topics='auto',
    top_n_words=20,
    n_gram_range=(1, 3),
    verbose=True
)

I do not send the full code, as there is a large codebase and might be very time-consuming to analyze. Piotr

MaartenGr commented 1 year ago

Then the issue that you are getting is because no actual topics were created. HDBSCAN typically does not work that well with small datasets, so setting min_cluster_size to a value like 3 would likely be necessary. Instead, you can use k-Means or another algorithm where you can specify k to perform the clustering instead. You can find more about that here.

piotrcelinski commented 1 year ago

Thank you very much!