MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.08k stars 757 forks source link

Representations from representation_models are generated twice when nr_topics="auto". #2144

Open MaartenGr opened 1 month ago

MaartenGr commented 1 month ago

Discussed in https://github.com/MaartenGr/BERTopic/discussions/2142

Originally posted by **siddharthtumre** September 10, 2024 Hi @MaartenGr While using nr_topics="auto", I can see that the representations from representation_models are calculated after reducing the number of topics. This increases the run time as a LLM is prompted twice i.e., before and after reducing topics. I believe generating representations after reducing the topics should be sufficient.

That's indeed what is happening currently. I believe it should be straightforward to implement that so this might be a good first issue for those wanting to help out.

For now, what you could also do is run the model first without any representation models and then use .update_topics without your selected representation model (e.g, LLM) to create the representations. Then the LLM would run only once and you get the same results.

pipa666 commented 1 month ago

Edit 2: I succesfully ran the test now. I will check how to do the PR, maybe Monday next week

Edit: I managed to run the tests, and its failing for edge cases. Seems mostly when the found topics are already less than the nr_topics to be reduced to. looking into it

Hello! This issue is quite important for me due to and usage/cost tracking when using LLMs API. I am not a CS-professional though, just a educational researcher using quantitative methods. So I apologize for the lack of experience

I will probably be coming back to this project so I gave this a shot, since it was flagged as a good first issue, and I think I solved it (at least for my purpose) . I don't know how to run a comprehensive test though. I can continue with the PR if I get some guidance on my this question, as I only tested for my data and parameters

Is the 'make test' enough? Haven't tried it yet since is mentioned it takes time. I will probably do that tomorrow or next week. In the meanwhile. here is what I did:

I modified the topic extraction in fit_transform adding a if not self.nr_topics: and a new arguement in _extract_topics :


# Extract topics by calculating c-TF-IDF
            if not self.nr_topics:
                self._extract_topics(custom_documents, embeddings=embeddings)
                self._create_topic_vectors(documents=documents, embeddings=embeddings)
            else:
                self._extract_topics(custom_documents, embeddings=embeddings, calculate_representation=False)
                self._create_topic_vectors(documents=documents, embeddings=embeddings)
                # Reduce topics
                custom_documents = self._reduce_topics(custom_documents)

            # Save the top 3 most representative documents per topic
            self._save_representative_docs(custom_documents)
        else:
            # Extract topics by calculating c-TF-IDF
            if not self.nr_topics:
                self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
            else:
                self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose, calculate_representation=False)
                # Reduce topics
                documents = self._reduce_topics(documents)

Then _extract_topics will call: self.topic_representations_ = self._extract_words_per_topic(words, documents, calculate_representation=calculate_representation, calculate_aspects=calculate_representation)

and then _extract_words_per_topic will only calculate base representation

        # Fine-tune the topic representations
        topics = base_topics.copy()
        if not self.representation_model or not calculate_representation:
            # Default representation: c_tf_idf + top_n_words
            topics = {label: values[: self.top_n_words] for label, values in topics.items()}
        elif calculate_representation and isinstance(self.representation_model, list):
            for tuner in self.representation_model:
                topics = tuner.extract_topics(self, documents, c_tf_idf, topics)
        elif calculate_representation and isinstance(self.representation_model, BaseRepresentation):
            topics = self.representation_model.extract_topics(self, documents, c_tf_idf, topics)
        elif calculate_representation and isinstance(self.representation_model, dict):
            if self.representation_model.get("Main"):
                main_model = self.representation_model["Main"]
                if isinstance(main_model, BaseRepresentation):
                    topics = main_model.extract_topics(self, documents, c_tf_idf, topics)
                elif isinstance(main_model, list):
                    for tuner in main_model:
                        topics = tuner.extract_topics(self, documents, c_tf_idf, topics)
                else:
                    raise TypeError(f"unsupported type {type(main_model).__name__} for representation_model['Main']")
            else:
                # Default representation: c_tf_idf + top_n_words
                topics = {label: values[: self.top_n_words] for label, values in topics.items()}
        else:
            raise TypeError(f"unsupported type {type(self.representation_model).__name__} for representation_model")

Another approach would have involved changing the way _reduce_topics gets initial_nr_topics = len(self.get_topics()) (maybe from the output of the clustering?)

It may be simpler doing the latter, but I didn't have the confidence or knowledge about the library to figure that out.

MaartenGr commented 1 month ago

@pipa666 Thank you for considering tackling this! If I'm not mistaken, I believe the main changes that need to be made are in _extract_words_per_topic and not that much elsewhere. You would only need something to track whether the reduction of topics already has taken place or not. If it has, then calculate everything, if not, only calculate the c-TF-IDF representation.

That way, I think you would only need to adjust _extract_words_per_topic and nothing else, right?

pipa666 commented 1 month ago

Wouldn't that affect when other instances are calling _extract_topics or _extract_words_per_topic? That's why I ended up adding another argument similar to calculate_aspects=True

I agree that it looks more messy now. I tried different approaches but never succeeded in keeping the fit_transorm() as concise as it was.

It was challenging because of the way the self.nr_topics > len(self.get_topics()) is handled, as it has to calculate representations after a first iteration of _extract_topics() defines topic size. However, if there is other way to get the topic size before calling _extract_topics() then it becomes easier, I just didn't figure that one out. My LLM friend only suggested me to define new auxiliary functions

MaartenGr commented 1 month ago

Wouldn't that affect when other instances are calling _extract_topics or _extract_words_per_topic? That's why I ended up adding another argument similar to calculate_aspects=True

It might indeed. I checked your code and it surprises me why just the calculate_aspects=True isn't sufficient. It seems to me, that you would only need to go from this:

https://github.com/MaartenGr/BERTopic/blob/0b4265a849e8084c0069d3aba0e492755d9a4024/bertopic/_bertopic.py#L492-L493

to something like this:

# Extract topics by calculating c-TF-IDF
if self.nr_topics:
    self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose, calculate_aspects=False)
else:
    self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)
pipa666 commented 1 month ago

But in that case, it would still calculate the "main" representation (which could be an LLM prompt)

As far as I understand, the calculate aspects=True arg is controlling whether additional aspects are included, or if its just the "main" representation (In the _extract_words_per_topic() function). In the case that the "main" repres is defined as an LLM, then we would run in the same issue when https://github.com/MaartenGr/BERTopic/blob/0b4265a849e8084c0069d3aba0e492755d9a4024/bertopic/_bertopic.py#L4294-L4306

Other option is, instead of adding an extra argument, to extend the calculate aspects to also control the main representation

Edit: I made it a bit cleaner on the fit_transform

 # Extract topics by calculating c-TF-IDF, reduce topics if needed, and get representations.
            self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose, calculate_representation=not self.nr_topics)

            if self.nr_topics:
                if (isinstance(self.nr_topics, (int)) and self.nr_topics < len(self.get_topics())) or isinstance(self.nr_topics, str):
                    # Reduce topics
                    documents = self._reduce_topics(documents)
                else:
                    logger.info(f"Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(self.get_topics())}).")
                    documents = self._sort_mappings_by_frequency(documents)
                    self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)

            # Save the top 3 most representative documents per topic
            self._save_representative_docs(documents)
MaartenGr commented 1 month ago

Ah, my apologies, I meant using calculate_representation instead:

# Extract topics by calculating c-TF-IDF
if self.nr_topics:
    self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose, calculate_representation=False)
else:
    self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)

If we approach it like the above (which is the only change needed in fit_transform), then we would only need to adapt _extract_words_per_topic and _extract_topics.

I don't really understand why it is necessary to do more changes than the above.

pipa666 commented 1 month ago

I agree, I will try. But there is an issue when dealing with the case of nr_topics >= initial_nr_topics. In that case the representations need to be re-calculated without any reduction. I am dealing with that exception in the fit_transform() because when I handle this exception inside the _reduce_topics() I get errors when running pytest (Only for cases of nr_topics >= initial_nr_topics)

If I manage the exception in _reduce_topics() as :

        if isinstance(self.nr_topics, int):
            if self.nr_topics < initial_nr_topics:
                documents = self._reduce_to_n_topics(documents, use_ctfidf)
            else:
                logger.info(f"Topic reduction - Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(self.get_topics())}).")
                documents = self._sort_mappings_by_frequency(documents)
                self._extract_topics(documents, verbose=self.verbose)
                return documents
        elif isinstance(self.nr_topics, str):
            documents = self._auto_reduce_topics(documents, use_ctfidf)
        else:
            raise ValueError("nr_topics needs to be an int or 'auto'! ")

        logger.info(
            f"Topic reduction - Reduced number of topics from {initial_nr_topics} to {len(self.get_topic_freq())}"
        )
        return documents

then, when running pytest, the tests throw errors for the edge cases where nr_topics > initial_nr_topics I have zero idea why this is the case, since the code should execute the same way as the current version of my commit, and when testing with my own data it works fine.

Here is the error:

model = 'online_topic_model', documents = ['\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pe...e practical, and the experimenters were\nprobably sure that it was 5 milliseconds, not 4 or 6 either.\n\n\nSteve', ...] request = <FixtureRequest for <Function test_topic_reduction_edge_cases[online_topic_model]>>

@pytest.mark.parametrize(
    "model",
    [
        ("kmeans_pca_topic_model"),
        ("base_topic_model"),
        ("custom_topic_model"),
        ("merged_topic_model"),
        ("reduced_topic_model"),
        ("online_topic_model"),
    ],
)
def test_topic_reduction_edge_cases(model, documents, request):
    topic_model = copy.deepcopy(request.getfixturevalue(model))
    topic_model.nr_topics = 100
    nr_topics = 5
    topics = np.random.randint(-1, nr_topics - 1, len(documents))
    old_documents = pd.DataFrame({"Document": documents, "ID": range(len(documents)), "Topic": topics})
    topic_model._update_topic_size(old_documents)
    topic_model._extract_topics(old_documents)
    old_freq = topic_model.get_topic_freq()

    new_documents = topic_model._reduce_topics(old_documents)
    new_freq = topic_model.get_topic_freq()

    assert not set(old_documents.Topic).difference(set(new_documents.Topic))
    pd.testing.assert_frame_equal(old_documents, new_documents)
  pd.testing.assert_frame_equal(old_freq, new_freq)
tests\test_representation\test_representations.py:164:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

left = array([-1,  3,  1,  2,  0], dtype=int64), right = array([-1,  0,  1,  2,  3], dtype=int64), err_msg = None

    def _raise(left, right, err_msg) -> NoReturn:
        if err_msg is None:
            if left.shape != right.shape:
                raise_assert_detail(
                    obj, f"{obj} shapes are different", left.shape, right.shape
                )

            diff = 0
            for left_arr, right_arr in zip(left, right):
                # count up differences
                if not array_equivalent(left_arr, right_arr, strict_nan=strict_nan):
                    diff += 1

            diff = diff * 100.0 / left.size
            msg = f"{obj} values are different ({np.round(diff, 5)} %)"
>           raise_assert_detail(obj, msg, left, right, index_values=index_values)
E           AssertionError: DataFrame.iloc[:, 0] (column name="Topic") are different
E
E           DataFrame.iloc[:, 0] (column name="Topic") values are different (40.0 %)
E           [index]: [0, 1, 4, 2, 3]
E           [left]:  [-1, 3, 1, 2, 0]
E           [right]: [-1, 0, 1, 2, 3]

C:\Users\Localadmin_pabflore\miniconda3\envs\Bertopicdevelop\lib\site-packages\pandas\_testing\asserters.py:684: AssertionError

I can try to keep looking on this further on the week.

MaartenGr commented 1 month ago

I'm not sure if I understand correctly. Why are you adding that additional else statement here:

if isinstance(self.nr_topics, int):
        if self.nr_topics < initial_nr_topics:
            documents = self._reduce_to_n_topics(documents, use_ctfidf)
        else:
            logger.info(f"Topic reduction - Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(self.get_topics())}).")
            documents = self._sort_mappings_by_frequency(documents)
            self._extract_topics(documents, verbose=self.verbose)
            return documents
        elif isinstance(self.nr_topics, str):
            documents = self._auto_reduce_topics(documents, use_ctfidf)
        else:
            raise ValueError("nr_topics needs to be an int or 'auto'! ")

It shouldn't do anything if self.nr_topics >= initial_nr_topics which is why there was initially no else statement.

https://github.com/MaartenGr/BERTopic/blob/0b4265a849e8084c0069d3aba0e492755d9a4024/bertopic/_bertopic.py#L4350-L4356

I believe there is no need to actually adapt reduce_topics if you apply what I mentioned before in fit_transform right?

pipa666 commented 1 month ago

If we don't have that adaptation the following happens: 1) User inputs a desired nr_topics 2) _extract_topics(calculate_representations = False) calculates initial_nr_topic(topic size) If the model can be reduced, then 3) the pipeline fit_transform()-> reduce_topics->reduce_to_n / reduce_auto will call _extract_topics(calculate_representations =True) at the end of the pipeline . However, when self.nr_topics >= initial_nr_topics then the code needs to call _extract_topics(calculate_representations = False), or it wont calculate any representations.

Since the original code calculated the representation already at step 2), and thus originating the current issue, this exception handling wasn't needed. But now, that representations are not calculated before attempting to reduce, then we need to add the _extract_topics(calculate_representations = True) for this case, wich wasn't considered before.

I hope this might help explain my logic.

MaartenGr commented 1 month ago

It does! I might have misunderstood the purpose of the calculate_representations parameter. I assumed that when it is False, it would generate a default c-TF-IDF representation as that one can then be used for reducing the number of topics since you will need either c-TF-IDF vectors or topic embeddings. As such, I would expect that when you only calculate the default representation (which is quite fast), there is no need to change code at other places. That way, the problem that we attempt to solve would be contained mostly in _extract_topics/_extract_words_per_topic.

pipa666 commented 1 month ago

I might have misunderstood the purpose of the calculate_representations parameter. I assumed that when it is False, it would generate a default c-TF-IDF representation as that one can then be used for reducing the number of topics since you will need either c-TF-IDF vectors or topic embeddings.

It does, but that wasn't true for the calculate_aspects (maybe that's what you meant)

So do you find the solution agreeable? If so, I can submit a PR request with the recommended format

I am still trying to solve the problem described here though. Which I find perplexing.

MaartenGr commented 1 month ago

So do you find the solution agreeable? If so, I can submit a PR request with the recommended format

Yeah, assuming we simplify a bunch of the code that I'm seeing here: https://github.com/pipa666/BERTopic/commit/4ce0b1af74bfa991581879750ae054a83da7147d

I'm hesitant to add so much overhead for what is in essence a rather straightforward feature, especially if we would just do it by applying .update_topics after having calculated the initial base representations. That would only result in like 4 lines of code at most.

pipa666 commented 2 weeks ago

Ok, I came back to this and solved it as

        else:
            # Extract topics by calculating c-TF-IDF, reduce topics if needed, and get representations.
            self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose, calculate_representation=not self.nr_topics)
            if self.nr_topics:
                documents = self._reduce_topics(documents, embeddings=embeddings)

            # Save the top 3 most representative documents per topic
            self._save_representative_docs(documents)

This keeps the fit_transform() clean.

I added some logger messsages to keep better track of the process and whether topics are being reduced or not.

I noted an issue that will appear in edge cases, which was related with my problem in running successful test (in particular for pytest tests\test_representation\test_representations.py).

As it is in my commit, when the self.nr_topics >= initial_nr_topics the algorithm extracts topics but wont self._sort_mappings_by_frequency(documents). This will produce a list of topics with unsorted indexes. When I added the sorting, then I runned in the error that I described above. It seems that this is already happening on the baseline and I don't want to fiddle with the testing units, since I am very ignorant on how they work.

This is probably not a critical concern since its an edge case, and topic information is still displayed sorted by frequencies, just that their topic index wont match their frequencies rank.

MaartenGr commented 1 week ago

@pipa666 Awesome, this already looks much cleaner! I think this should be ready for a pull request. There are a couple of questions/suggestions that I have but I think we can continue that discussion in the PR.

As it is in my commit, when the self.nr_topics >= initial_nr_topics the algorithm extracts topics but wont self._sort_mappings_by_frequency(documents). This will produce a list of topics with unsorted indexes. When I added the sorting, then I runned in the error that I described https://github.com/MaartenGr/BERTopic/issues/2144#issuecomment-2352541831. It seems that this is already happening on the baseline and I don't want to fiddle with the testing units, since I am very ignorant on how they work.

Yeah, users wanting more topics than the clustering gives back actually happens quite often and indeed should be captured. What about calculating _extract_topics(calculate_representations = False) by default first when you are using nr_topics?

This is probably not a critical concern since its an edge case, and topic information is still displayed sorted by frequencies, just that their topic index wont match their frequencies rank.

I think it will be as I have seen many users getting to this edge case unfortunately.