Closed MogwaiMomo closed 1 year ago
Issue #811 might interest you || BERTopic: get_representative_docs(...) | Option to get all docs mapped to a topic beyond the default randomly selected 3
You can work with the topics_
or extend the Class to tweak to 5 from 3 (for own use). From what I gathered, this is not the preferred route.
I'll still advocate for 'arg' option, just as you are asking for. Although in my use-case, I needed ALL documents per topic.
PS: I notice that _representative_docs is changing to dict {}
in v0.13. I also notice what might appear as freeing _representative_docs
hdbscan (allowing cosine_similarity for other cases).
@MogwaiMomo
Thank you for your kind words! Unfortunately, that is currently not possible. I have been quite hesitant adding parameters to BERTopic in order to keep the package relatively straightforward to use. Just out of curiosity, would something like the following suffice for you:
# When you used `.fit_transform`:
df = pd.DataFrame({"Document": docs, "Topic": topic})
# When you used `.fit`:
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})
That way, you have all documents mapped to a topic and you can go through a sample of those yourself. Or would you want the top n most representative documents extracted instead?
@semmyk-research
I'll still advocate for 'arg' option, just as you are asking for. Although in my use-case, I needed ALL documents per topic.
Increasing the Class to include all documents per topic is not possible since the most representative documents in a topic are calculated and not taken as a direct random sample from all documents. In other words, those are the documents that are most similar to the entire cluster. You can find more about that calculation here.
PS: I notice that _representative_docs is changing to dict {} in v0.13.
Yes and no, the self.representative_docs_
variable always has been a dictionary if the representative documents were calculated. In order to extend some support and to prevent some specific bugs, it now has to be initialized as such.
I also notice what might appear as freeing _representative_docs hdbscan (allowing cosine_similarity for other cases).
Yes. When you use HDBSCAN in BERTopic we are using some of the internal structure of HDBSCAN in order to extract the documents that are best representative of a cluster. However, this does not work with other clustering algorithms since they do not have the same internal workings. Instead, with v0.13, it is now possible to get the most representative documents per topic by calculating the c-TF-IDF representation on a random subset of all documents in a topic and comparing those with the topic c-TF-IDF representation. The most similar documents per topic are then extracted.
Once again @MaartenGr, thanks for BERTopic and the responses you take time out to provide. The more I dig in, the more 'honey I ooze out'.
While going through bertopic when I was responding to @MogwaiMomo , I noticed the NOTE to def get_representative_docs(self, topic: int = None) -> List[str]:
That was a masterpiece as I reflected back on what I gained during my discourse in issue #811. Including how to get around what I needed then ... ALL documents per topic
Yes, we all have and will have feature requests. By the way, what is a great piece of software if it doesn't have features request? The more we use and love it, the more we request more!!!
However, for a library (software) to remain good and fluid at what it's great at, it should be LEAN and robust as it can be with little fluffing. Hence, in principle, I SUPPORT @MaartenGr hesitation in adding parameters to BERTopic in order to keep the package relatively straightforward to use. I can say without hesitation that BERTopic has assisted me in some aspects where I'd spent countless days with Gensim.
Or would you want the top n most representative documents extracted instead?
@MaartenGr Yes, this is what I was hoping for :)
My code does already output a kind of "master" dataframe of all documents with the schema
| topics | docs | probs |
where documents are sorted first by ascending topic # and then by descending prob (again, thank you for making it so easy to generate & access these values!), but I'd noticed that the documents extracted by the 'get_representative_docs()' function seemed to be more coherent and similar to each other than the docs with the highest probs per topic that I'd see in my aforementioned master dataframe. (This might be just my own bias though)
Could you expand a little on the difference between how documents are scored/ranked on their representativeness vs. the probs they're assigned as a result of .fit_transform()?
(Your 2020 Medium essay mentions using c-TF-IDF scores to get the most important/representative words per topic ... am I way off to assume that scoring the 'representativeness' of a given document could be done simply by summing up the c-TF-IDF scores of each word in said document?)
but I'd noticed that the documents extracted by the 'get_representative_docs()' function seemed to be more coherent and similar to each other than the docs with the highest probs per topic that I'd see in my aforementioned master dataframe. (This might be just my own bias though)
The documents extracted by .get_representative_docs
are extracted through some underlying structure in HDBSCAN by detecting which points can be considered exemplars. This process does not work currently if you use any clustering algorithm other than HDBSCAN. Although there is no parameter to get the top n
documents per document, you can adjust the core algorithm as follows:
from bertopic import BERTopic
class BERTopicAdjusted(BERTopic):
def _save_representative_docs(self, documents: pd.DataFrame):
""" Save the most representative docs (3) per topic
The most representative docs are extracted by taking
the exemplars from the HDBSCAN-generated clusters.
Full instructions can be found here:
https://hdbscan.readthedocs.io/en/latest/soft_clustering_explanation.html
Arguments:
documents: Dataframe with documents and their corresponding IDs
"""
# Prepare the condensed tree and luf clusters beneath a given cluster
condensed_tree = self.hdbscan_model.condensed_tree_
raw_tree = condensed_tree._raw_tree
clusters = sorted(condensed_tree._select_clusters())
cluster_tree = raw_tree[raw_tree['child_size'] > 1]
# Find the points with maximum lambda value in each leaf
representative_docs = {}
for topic in documents['Topic'].unique():
if topic != -1:
leaves = hdbscan.plots._recurse_leaf_dfs(cluster_tree, clusters[topic])
result = np.array([])
for leaf in leaves:
max_lambda = raw_tree['lambda_val'][raw_tree['parent'] == leaf].max()
points = raw_tree['child'][(raw_tree['parent'] == leaf) & (raw_tree['lambda_val'] == max_lambda)]
result = np.hstack((result, points))
# Below, we get the top 3 documents per topic. If you want to have more documents per topic,
# simply increase the value of `3` to whatever value you want:
representative_docs[topic] = list(np.random.choice(result, 3, replace=False).astype(int))
# Convert indices to documents
self.representative_docs_ = {topic: [documents.iloc[doc_id].Document for doc_id in doc_ids]
for topic, doc_ids in
representative_docs.items()}
Could you expand a little on the difference between how documents are scored/ranked on their representativeness vs. the probs they're assigned as a result of .fit_transform()?
Representativeness and membership to a cluster are different things and as such are likely to produce different results if they are used for the same purpose. Cluster membership is not necessarily a metric for representation. You can find more about how HDBSCAN handles that here.
In the v0.14 release of BERTopic, all representative documents are extracted in the same way regardless of whether you are using HDBSCAN or another clustering algorithm. A random subset of 500 documents is sampled for each cluster after which we use c-TF-IDF to score those documents. The resulting values are compared with their topic's c-TF-IDF values to rank the documents based on their closeness to a topic.
In v0.14, you can do the following to extract a specific number of representative documents:
import pandas as pd
# Prepare your documents to be used in a dataframe
documents = pd.DataFrame({"Document": docs,
"ID": range(len(docs)),
"Topic": topic_model.topics_})
# Extract the top 50 representative documents
repr_docs, _, _ = topic_model._extract_representative_docs(c_tf_idf=topic_model.c_tf_idf_,
documents=documents,
topics=topic_model.topic_representations_ ,
nr_repr_docs=50)
Do note though that accessing private functions will always be at risk for changes in future releases and is typically only advised to use if you do version control of BERTopic and its dependencies.
(Your 2020 Medium essay mentions using c-TF-IDF scores to get the most important/representative words per topic ... am I way off to assume that scoring the 'representativeness' of a given document could be done simply by summing up the c-TF-IDF scores of each word in said document?)
There is currently a method for doing something similar in the PR here but what is done here is calculating the c-TF-IDF representation for each document in a topic and comparing that, through cosine similarity, with the topic c-TF-IDF representations. This is what was implemented in v0.13 and has been further pursued in v0.14.
Thank you so much for this comprehensive reply! Will dig into the code & links you've provided :)
Hello Maarten!
Just a quick note that running the class BERTopicAdjusted(BERTopic) listed above with the latest version of bertopic (v0.14) appears to create mismatching topic numbers — i.e., the topic numbers from .get_representative_docs() end up being different from those in .get_topic_info().
Also, switching back to just using BERTopic() without the adjusted class eliminates this mismatch.
@MogwaiMomo That is correct! In the v0.14 release, it was updated how the representative documents were generated. It now produces a fixed number of representative documents regardless of whether topics were merged/combined. Instead, you can now use the following:
import pandas as pd
# Prepare your documents to be used in a dataframe
documents = pd.DataFrame({"Document": docs,
"ID": range(len(docs)),
"Topic": topic_model.topics_})
# Extract the top 50 representative documents
repr_docs, _, _ = topic_model._extract_representative_docs(c_tf_idf=topic_model.c_tf_idf_,
documents=documents,
topics=topic_model.topic_representations_ ,
nr_repr_docs=50)
Do note though that accessing private functions will always be at risk for changes in future releases and is typically only advised to use if you do version control of BERTopic and its dependencies.
I'll make sure to update the code above to indicate for which versions they work.
Discussed in https://github.com/MaartenGr/BERTopic/discussions/847