MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Ability to change # of documents pulled per topic by .get_representative_docs()? #848

Closed MogwaiMomo closed 1 year ago

MogwaiMomo commented 1 year ago

Discussed in https://github.com/MaartenGr/BERTopic/discussions/847

Originally posted by **MogwaiMomo** November 18, 2022 Hello! Is there currently any way to adjust the number of documents pulled up by .get_representative_docs method? As far as I can tell the default is to return 3 documents per topic. For me it would be incredibly valuable to be able to increase customize this number, so that you could pull up 5 or 10 representative documents per topic. Thank you in advance and just a note that I cannot believe BERTopic exists as it is SO awesome and so thoughtfully designed :)
semmyk-research commented 1 year ago

Issue #811 might interest you || BERTopic: get_representative_docs(...) | Option to get all docs mapped to a topic beyond the default randomly selected 3

You can work with the topics_ or extend the Class to tweak to 5 from 3 (for own use). From what I gathered, this is not the preferred route. I'll still advocate for 'arg' option, just as you are asking for. Although in my use-case, I needed ALL documents per topic.

PS: I notice that _representative_docs is changing to dict {} in v0.13. I also notice what might appear as freeing _representative_docs hdbscan (allowing cosine_similarity for other cases).

MaartenGr commented 1 year ago

@MogwaiMomo

Thank you for your kind words! Unfortunately, that is currently not possible. I have been quite hesitant adding parameters to BERTopic in order to keep the package relatively straightforward to use. Just out of curiosity, would something like the following suffice for you:

# When you used `.fit_transform`:
df = pd.DataFrame({"Document": docs, "Topic": topic})

# When you used `.fit`:
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})

That way, you have all documents mapped to a topic and you can go through a sample of those yourself. Or would you want the top n most representative documents extracted instead?

@semmyk-research

I'll still advocate for 'arg' option, just as you are asking for. Although in my use-case, I needed ALL documents per topic.

Increasing the Class to include all documents per topic is not possible since the most representative documents in a topic are calculated and not taken as a direct random sample from all documents. In other words, those are the documents that are most similar to the entire cluster. You can find more about that calculation here.

PS: I notice that _representative_docs is changing to dict {} in v0.13.

Yes and no, the self.representative_docs_ variable always has been a dictionary if the representative documents were calculated. In order to extend some support and to prevent some specific bugs, it now has to be initialized as such.

I also notice what might appear as freeing _representative_docs hdbscan (allowing cosine_similarity for other cases).

Yes. When you use HDBSCAN in BERTopic we are using some of the internal structure of HDBSCAN in order to extract the documents that are best representative of a cluster. However, this does not work with other clustering algorithms since they do not have the same internal workings. Instead, with v0.13, it is now possible to get the most representative documents per topic by calculating the c-TF-IDF representation on a random subset of all documents in a topic and comparing those with the topic c-TF-IDF representation. The most similar documents per topic are then extracted.

semmyk-research commented 1 year ago

Once again @MaartenGr, thanks for BERTopic and the responses you take time out to provide. The more I dig in, the more 'honey I ooze out'. While going through bertopic when I was responding to @MogwaiMomo , I noticed the NOTE to def get_representative_docs(self, topic: int = None) -> List[str]:
That was a masterpiece as I reflected back on what I gained during my discourse in issue #811. Including how to get around what I needed then ... ALL documents per topic

Yes, we all have and will have feature requests. By the way, what is a great piece of software if it doesn't have features request? The more we use and love it, the more we request more!!!

However, for a library (software) to remain good and fluid at what it's great at, it should be LEAN and robust as it can be with little fluffing. Hence, in principle, I SUPPORT @MaartenGr hesitation in adding parameters to BERTopic in order to keep the package relatively straightforward to use. I can say without hesitation that BERTopic has assisted me in some aspects where I'd spent countless days with Gensim.

MogwaiMomo commented 1 year ago

Or would you want the top n most representative documents extracted instead?

@MaartenGr Yes, this is what I was hoping for :)

My code does already output a kind of "master" dataframe of all documents with the schema

| topics | docs | probs |

where documents are sorted first by ascending topic # and then by descending prob (again, thank you for making it so easy to generate & access these values!), but I'd noticed that the documents extracted by the 'get_representative_docs()' function seemed to be more coherent and similar to each other than the docs with the highest probs per topic that I'd see in my aforementioned master dataframe. (This might be just my own bias though)

Could you expand a little on the difference between how documents are scored/ranked on their representativeness vs. the probs they're assigned as a result of .fit_transform()?

(Your 2020 Medium essay mentions using c-TF-IDF scores to get the most important/representative words per topic ... am I way off to assume that scoring the 'representativeness' of a given document could be done simply by summing up the c-TF-IDF scores of each word in said document?)

MaartenGr commented 1 year ago

but I'd noticed that the documents extracted by the 'get_representative_docs()' function seemed to be more coherent and similar to each other than the docs with the highest probs per topic that I'd see in my aforementioned master dataframe. (This might be just my own bias though)

v0.13

The documents extracted by .get_representative_docs are extracted through some underlying structure in HDBSCAN by detecting which points can be considered exemplars. This process does not work currently if you use any clustering algorithm other than HDBSCAN. Although there is no parameter to get the top n documents per document, you can adjust the core algorithm as follows:

from bertopic import BERTopic

class BERTopicAdjusted(BERTopic):
    def _save_representative_docs(self, documents: pd.DataFrame):
        """ Save the most representative docs (3) per topic
        The most representative docs are extracted by taking
        the exemplars from the HDBSCAN-generated clusters.
        Full instructions can be found here:
            https://hdbscan.readthedocs.io/en/latest/soft_clustering_explanation.html
        Arguments:
            documents: Dataframe with documents and their corresponding IDs
        """
        # Prepare the condensed tree and luf clusters beneath a given cluster
        condensed_tree = self.hdbscan_model.condensed_tree_
        raw_tree = condensed_tree._raw_tree
        clusters = sorted(condensed_tree._select_clusters())
        cluster_tree = raw_tree[raw_tree['child_size'] > 1]

        #  Find the points with maximum lambda value in each leaf
        representative_docs = {}
        for topic in documents['Topic'].unique():
            if topic != -1:
                leaves = hdbscan.plots._recurse_leaf_dfs(cluster_tree, clusters[topic])

                result = np.array([])
                for leaf in leaves:
                    max_lambda = raw_tree['lambda_val'][raw_tree['parent'] == leaf].max()
                    points = raw_tree['child'][(raw_tree['parent'] == leaf) & (raw_tree['lambda_val'] == max_lambda)]
                    result = np.hstack((result, points))

                # Below, we get the top 3 documents per topic. If you want to have more documents per topic, 
                # simply increase the value of `3` to whatever value you want:
                representative_docs[topic] = list(np.random.choice(result, 3, replace=False).astype(int))  

        # Convert indices to documents
        self.representative_docs_ = {topic: [documents.iloc[doc_id].Document for doc_id in doc_ids]
                                     for topic, doc_ids in
                                     representative_docs.items()}

Could you expand a little on the difference between how documents are scored/ranked on their representativeness vs. the probs they're assigned as a result of .fit_transform()?

Representativeness and membership to a cluster are different things and as such are likely to produce different results if they are used for the same purpose. Cluster membership is not necessarily a metric for representation. You can find more about how HDBSCAN handles that here.

v0.14+

In the v0.14 release of BERTopic, all representative documents are extracted in the same way regardless of whether you are using HDBSCAN or another clustering algorithm. A random subset of 500 documents is sampled for each cluster after which we use c-TF-IDF to score those documents. The resulting values are compared with their topic's c-TF-IDF values to rank the documents based on their closeness to a topic.

In v0.14, you can do the following to extract a specific number of representative documents:

import pandas as pd

# Prepare your documents to be used in a dataframe
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topic_model.topics_})

# Extract the top 50 representative documents
repr_docs, _, _ = topic_model._extract_representative_docs(c_tf_idf=topic_model.c_tf_idf_,
                                                          documents=documents,
                                                          topics=topic_model.topic_representations_ ,
                                                          nr_repr_docs=50)

Do note though that accessing private functions will always be at risk for changes in future releases and is typically only advised to use if you do version control of BERTopic and its dependencies.

(Your 2020 Medium essay mentions using c-TF-IDF scores to get the most important/representative words per topic ... am I way off to assume that scoring the 'representativeness' of a given document could be done simply by summing up the c-TF-IDF scores of each word in said document?)

There is currently a method for doing something similar in the PR here but what is done here is calculating the c-TF-IDF representation for each document in a topic and comparing that, through cosine similarity, with the topic c-TF-IDF representations. This is what was implemented in v0.13 and has been further pursued in v0.14.

MogwaiMomo commented 1 year ago

Thank you so much for this comprehensive reply! Will dig into the code & links you've provided :)

MogwaiMomo commented 1 year ago

Hello Maarten!

Just a quick note that running the class BERTopicAdjusted(BERTopic) listed above with the latest version of bertopic (v0.14) appears to create mismatching topic numbers — i.e., the topic numbers from .get_representative_docs() end up being different from those in .get_topic_info().

Also, switching back to just using BERTopic() without the adjusted class eliminates this mismatch.

MaartenGr commented 1 year ago

@MogwaiMomo That is correct! In the v0.14 release, it was updated how the representative documents were generated. It now produces a fixed number of representative documents regardless of whether topics were merged/combined. Instead, you can now use the following:

import pandas as pd

# Prepare your documents to be used in a dataframe
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topic_model.topics_})

# Extract the top 50 representative documents
repr_docs, _, _ = topic_model._extract_representative_docs(c_tf_idf=topic_model.c_tf_idf_,
                                                          documents=documents,
                                                          topics=topic_model.topic_representations_ ,
                                                          nr_repr_docs=50)

Do note though that accessing private functions will always be at risk for changes in future releases and is typically only advised to use if you do version control of BERTopic and its dependencies.

I'll make sure to update the code above to indicate for which versions they work.