MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.19k stars 765 forks source link

Fixed issue #2144 #2191

Closed PipaFlores closed 2 weeks ago

PipaFlores commented 1 month ago

What does this PR do?

I'm sorry for making a new PR, but hopefully this is much more clear/clean

Fixes https://github.com/MaartenGr/BERTopic/issues/2144 (issue) by optimizing the topic extraction process when using fit_transform() with nr_topics="auto" or int for reducing topics. The main improvements are:

Additionally, this PR addresses an edge case where self.nr_topics >= initial_nr_topics by adding self._sort_mappings_by_frequency(documents) in def _reduce_topics() in line 4369. This was not accounted with previous tests (they raised a false negative), so test_representations.py was modified to account for this.

Before submitting

MaartenGr commented 1 month ago

I'm sorry for making a new PR, but hopefully this is much more clear/clean

No problem, if a simpler fix is possible I would definitely like to go for that.

Modifying the fit_transform method to extract_topics only after topic reduction (if needed). On my previous PR I added a calculate representations argument. This is not necessary here, as the initial nr of topics is now retrieved as initial_nr_topics = len(documents["Topic"].unique()), which is an outcome of the clustering procedure. So this doesn't require a previous call on extract_topics(), in any form.

I thought this wasn't possible since we need either the c-TF-IDF representations of the topics or the topic embeddings to reduce the topics. Neither the c-TF-IDF or the topic embeddings are outcomes of the clustering procedure, so how can then _reduce_topics run without that information?

Additionally, this PR addresses an edge case where self.nr_topics >= initial_nr_topics by adding self._sort_mappings_by_frequency(documents) in def _reduce_topics() in line 4369. This was not accounted with previous tests (they raised a false negative), so test_representations.py was modified to account for this.

I still have to wrap my head around this one. By adding this, aren't we essentially ignoring an issue here that is introduced by the PR?

As you mentioned in the previous PR:

Additionally, this PR addresses an edge case where self.nr_topics >= initial_nr_topics. In this scenario, the current implementation extracts topics but doesn't sort the mappings by frequency. This results in a list of topics with unsorted indexes. While topic information is still displayed sorted by frequencies, the topic indexes may not match their frequency rank.

Although this is an edge case, I actually think this might happen more often than you think. Users do not know the number of clustered topics beforehand and might overshoot. Or am I missing something here?

PipaFlores commented 1 month ago

Long story short:

Long story long: Regarding the edge case. This is completely solved. The confusion raises as in my first attempt I couldn't tackle it and I mentioned it in the issue and the first PR, as you pointed:

Additionally, this PR addresses an edge case where self.nr_topics >= initial_nr_topics. In this scenario, the current implementation extracts topics but doesn't sort the mappings by frequency. This results in a list of topics with unsorted indexes. While topic information is still displayed sorted by frequencies, the topic indexes may not match their frequency rank.

However, this was addressed here by adding documents = self._sort_mappings_by_frequency(documents) in line #4356 and by modifying the test_representations.py to check for sorted topics in the edge case scenarios. The modifications of the test now correctly check for sorted topics (which was not done before, that is why it raised a false error). Note that this is in a current error, not brought by this PR, but I am fairly certain that this edge case is handled well now and users won't encounter unsorted topics in any case.

Regarding the other point. This also touches on my recently opened issue regarding logging and, more broadly, conceptualization of the pipeline. In particular, our confusion here arises from the question of when topics are "extracted".

The way I see it, the topics are formed, as a group of documents/embeddings, at the end of the clustering procedure. Here is where the column documents["Topics"] is first defined. Everything that comes after is just enhancing the interpretability (or representations) of each group of embeddings (therefore "defining" the topic). As such, the number of topics is already defined when clustering in the line 463: documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)

The documents dataframe is assigned a ["Topics"] column with the labels assigned by HDBSCAN (or any other clustering proc). These are the labels that are afterwards sorted.

It was unnecessary to rely on _extract_topics() just to get the number of topics. A quick check verified that len(documents["Topic"].unique())(after clustering) matches the result of len(self.get_topics() (after extract_topic). Therefore, I bypassed _extract_topics() entirely, since its only role was to obtain the topic count.

While _reduce_topics() could use the c_tf-idf, this is only the case when the argument use_ctfidf = True. However, this value is by default False and cant be changed on bertopic intialization or fit_transform(). It can only be manually prompted when reducing topics after the model has been fully fitted, including the c-tf-idf.

MaartenGr commented 1 month ago

While _reduce_topics() could use the c_tf-idf, this is only the case when the argument use_ctfidf = True. However, this value is by default False and cant be changed on bertopic intialization or fit_transform(). It can only be manually prompted when reducing topics after the model has been fully fitted, including the c-tf-idf.

That's exactly the thing. In order to use _reduce_topics you need either topic embeddings or c-TF-IDF representations. Otherwise, there is nothing that can be used to actually reduce the topics.

I forked your branch and ran the following which gave me an error referencing exactly this problem:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
model = BERTopic(nr_topics=2, verbose=True).fit(docs)

Which gave me the following error log:

2024-10-22 13:49:12,464 - BERTopic - Embedding - Completed ✓
2024-10-22 13:49:12,464 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-10-22 13:49:16,580 - BERTopic - Dimensionality - Completed ✓
2024-10-22 13:49:16,581 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-10-22 13:49:17,012 - BERTopic - Cluster - Completed ✓
2024-10-22 13:49:17,013 - BERTopic - Topic reduction - Reducing number of topics
2024-10-22 13:49:17,014 - BERTopic - WARNING: No topic embeddings were found despite they are supposed to be used (`use_ctfidf` is False). Defaulting to c-TF-IDF representation.

After which I get the following error:

TypeError                                 Traceback (most recent call last)
Cell In[20], line 5
      2 from sklearn.datasets import fetch_20newsgroups
      4 docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
----> 5 model = BERTopic(nr_topics=2, verbose=True).fit(docs)

File [~\Documents\Projects\BERTopicNew\bertopic\_bertopic.py:364](http://localhost:8888/lab/tree/notebooks/~/Documents/Projects/BERTopicNew/bertopic/_bertopic.py#line=363), in BERTopic.fit(self, documents, embeddings, images, y)
    322 def fit(
    323     self,
    324     documents: List[str],
   (...)
    327     y: Union[List[int], np.ndarray] = None,
    328 ):
    329     """Fit the models (Bert, UMAP, and, HDBSCAN) on a collection of documents and generate topics.
    330 
    331     Arguments:
   (...)
    362     ```
    363     """
--> 364     self.fit_transform(documents=documents, embeddings=embeddings, y=y, images=images)
    365     return self

File [~\Documents\Projects\BERTopicNew\bertopic\_bertopic.py:494](http://localhost:8888/lab/tree/notebooks/~/Documents/Projects/BERTopicNew/bertopic/_bertopic.py#line=493), in BERTopic.fit_transform(self, documents, embeddings, images, y)
    491 else:
    492     # Reduce topics if needed, extract topics by calculating c-TF-IDF, and get representations.
    493     if self.nr_topics:
--> 494         documents = self._reduce_topics(documents)
    495     else:
    496         self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)

File [~\Documents\Projects\BERTopicNew\bertopic\_bertopic.py:4351](http://localhost:8888/lab/tree/notebooks/~/Documents/Projects/BERTopicNew/bertopic/_bertopic.py#line=4350), in BERTopic._reduce_topics(self, documents, use_ctfidf)
   4349 if isinstance(self.nr_topics, int):
   4350     if self.nr_topics < initial_nr_topics:
-> 4351         documents = self._reduce_to_n_topics(documents, use_ctfidf)
   4352     else:
   4353         logger.info(
   4354             f"Topic reduction - Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(documents['Topic'].unique())})."
   4355         )

File [~\Documents\Projects\BERTopicNew\bertopic\_bertopic.py:4383](http://localhost:8888/lab/tree/notebooks/~/Documents/Projects/BERTopicNew/bertopic/_bertopic.py#line=4382), in BERTopic._reduce_to_n_topics(self, documents, use_ctfidf)
   4380 topics = documents.Topic.tolist().copy()
   4382 # Create topic distance matrix
-> 4383 topic_embeddings = select_topic_representation(
   4384     self.c_tf_idf_, self.topic_embeddings_, use_ctfidf, output_ndarray=True
   4385 )[0][self._outliers :]
   4386 distance_matrix = 1 - cosine_similarity(topic_embeddings)
   4387 np.fill_diagonal(distance_matrix, 0)

TypeError: 'NoneType' object is not subscriptable

As shown, because no topic embeddings or c-TF-IDF representations were generated, the _reduce_topics function throws an error. Without those, how would we actually do similarity checks between topics to see which ones can be merged?

But the strangest thing to me is that the tests seem to pass without any issues, even though they really shouldn't.

PipaFlores commented 1 month ago

True! I had my scripts checking only for the edge case and not for normal topic reduction. Is strange indeed that it doesn't show in the tests. I might check on that in the near-mid future.

The solution then might revert to the previous version then (the one with the included calculate_representation argument). Or, there might be a way to compute only the topics embeddings before. (e.g., if topics_embeddings == None then calculate them with _create_topic_vectors within the reduce_topics). If there is an easy way to avoid computing c-tf-idf then I would suggest going for that, but otherwise we can come back to calculating the _extract_topics only with base c-tf-idf representation.

I can amend the commit taking into account the comments of the previous PR to improve clarity

PipaFlores commented 1 month ago

Ok, I made a new commit. I didn't amend the previous commit to keep some reference (Note: i ended up forcepushing a logging detail, my bad), and I thought that maybe you can do that when merging (still getting my head around shared repos). If I am wrong, let me know and I can squash everything on one commit.

This returns to previous solution, but now the calculate_aspects is renamed as fine_tune_representation . This is much more clear, as this argument only controls the use of optional representation models for fine tuning, and it doesn't affect the c-tf-idf base representation.

Importantly, the edge case is still fully handled in the code and the test.

When it comes to logging, a normal fit looks the same as before

2024-10-22 17:08:37,841 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm 2024-10-22 17:08:42,944 - BERTopic - Dimensionality - Completed ✓ 2024-10-22 17:08:42,945 - BERTopic - Cluster - Start clustering the reduced embeddings 2024-10-22 17:08:42,951 - BERTopic - Cluster - Completed ✓ 2024-10-22 17:08:42,953 - BERTopic - Representation - Fine-tuning topics using representation models. 2024-10-22 17:08:44,528 - BERTopic - Representation - Completed ✓

But when an desired number of topics is inputted, the modified log will appear as

2024-10-22 17:08:44,553 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm 2024-10-22 17:08:45,410 - BERTopic - Dimensionality - Completed ✓ 2024-10-22 17:08:45,411 - BERTopic - Cluster - Start clustering the reduced embeddings 2024-10-22 17:08:45,416 - BERTopic - Cluster - Completed ✓ 2024-10-22 17:08:45,418 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction. 2024-10-22 17:08:45,431 - BERTopic - Representation - Completed ✓ 2024-10-22 17:08:45,432 - BERTopic - Topic reduction - Reducing number of topics 2024-10-22 17:08:45,436 - BERTopic - Representation - Fine-tuning topics using representation models. 2024-10-22 17:08:46,206 - BERTopic - Representation - Completed ✓ 2024-10-22 17:08:46,207 - BERTopic - Topic reduction - Reduced number of topics from 22 to 10

The 'for topic reduction' is added because this log only appears when fit_transform() tries to reduce topics and the c-tf-idf is calculated only for that purpose. The log still doesn't show every time the c-tf-idf is calculated, but that is another issue.

MaartenGr commented 3 weeks ago

Thanks for updating this and taking the time to merge these two commits. This logging is definitely much nicer now and should be more intuitive to users!

I think this solution is arguably one of the cleanest here. One last question though, could you add a test that showcases that it now works?

Also, I'm seeing the tests fail for both the linting as well as the code.

PipaFlores commented 3 weeks ago

What kind of test are you referring to?

Indeed the tests are failing for linting, I forgot to run the make format and make lint on my end. I'm not sure about the code test though. Probably it can get fixed after the linting/formatting

When I get clarity on the kind of test you want I can see about it and force push everything into a single commit.

MaartenGr commented 3 weeks ago

The one that checks whether topic vectors/c-tf-idf are calculated so that we can actually reduce the number of topics correctly. It might be that the testing pipeline doesn't actually reduce the number of topics, so we might need to up the number of documents in order to create enough clusters to reduce.

PipaFlores commented 3 weeks ago

Sorry for long text,

By reviewing the test scripts, I can't find where the tests are checking whether the model is correctly fitted or not. I will describe my understanding but I am not familiar with the pytest framework, so please point me out if I missed something.

In the conftest.pyfile, the documents, embeddings and models that are used in all tests are defined. For example, the following code fits and defines a base model to be used in later stages:

@pytest.fixture(scope="session")
def base_topic_model(documents, document_embeddings, embedding_model):
    model = BERTopic(embedding_model=embedding_model, calculate_probabilities=True)
    model.umap_model.random_state = 42
    model.hdbscan_model.min_cluster_size = 3
    model.fit(documents, document_embeddings)
    return model

And then these models are used to perform tests that are related to most of the model class methods. To give an example illustrative of the current issue, the following is what tests the reduce_topics()method:

@pytest.mark.parametrize("reduced_topics", [2, 4, 10])
def test_topic_reduction(model, reduced_topics, documents, request):
    topic_model = copy.deepcopy(request.getfixturevalue(model))
    old_topics = copy.deepcopy(topic_model.topics_)
    old_freq = topic_model.get_topic_freq()

    topic_model.reduce_topics(documents, nr_topics=reduced_topics)

    new_freq = topic_model.get_topic_freq()

    if model != "online_topic_model":
        assert old_freq.Count.sum() == new_freq.Count.sum()
    assert len(old_freq.Topic.unique()) == len(old_freq)
    assert len(new_freq.Topic.unique()) == len(new_freq)
    assert len(topic_model.topics_) == len(old_topics)
    assert topic_model.topics_ != old_topics

However, it seems that the fitand/or fit_transformmethods themselves are not being tested anywherre.

This helps us understand what you noticed before:

As shown, because no topic embeddings or c-TF-IDF representations were generated, the _reduce_topics function throws an error. Without those, how would we actually do similarity checks between topics to see which ones can be merged? But the strangest thing to me is that the tests seem to pass without any issues, even though they really shouldn't.

As the test is configured, topic reduction is tested only when is called after the model is fitted (thus, there is an existing ctfidf), and the error that you points occurs only while fitting (with a defined number of topics, or auto reduce)

I am sure that this particular error doesn't happen anymore as I've been testing mostly by fitting my local dataset considering the different cases and edge cases (also verified by my other issue to improve the logger). However, there are no scripted tests to sistematically verify that.

A direct solution would involve in creating a new test for the fit_transformand asserting that the ct-idf is calculated when the model is set to reduce topics. A comprehensive solution involves creating tests that assert every main step of the fitting procedure (embeddings, dimred, cluster, representation).

I hope this was clear enough. I can try to give it some though other day. I can include a specific test for this issue in this PR. But a comprehensive test of the fit_transformprobably encompasses a different PR (but, again, im not familiar with shared repos so you can suggest either way)

MaartenGr commented 2 weeks ago

As the test is configured, topic reduction is tested only when is called after the model is fitted (thus, there is an existing ctfidf), and the error that you points occurs only while fitting (with a defined number of topics, or auto reduce)

Ah, this makes perfect sense. Thanks for taking the time to research this. It indeed seems I reduced the number of topics after fitting the model as a way to reduce wall time. A nasty side-effect is that nr_topics isn't properly tested here.

I hope this was clear enough. I can try to give it some though other day. I can include a specific test for this issue in this PR. But a comprehensive test of the fit_transform probably encompasses a different PR (but, again, im not familiar with shared repos so you can suggest either way)

It definitely was, thanks! I agree, this is a bit out of scope for now and since you tested this thorougly I think this should be all for now. I will check it with my code, but if everything else passes, I think it's time to merge this.

MaartenGr commented 2 weeks ago

It works for me! Thank you for working on this, I'll go ahead and merge this 😄