MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

"get_representative_docs" #765

Closed srashtchi closed 2 years ago

srashtchi commented 2 years ago

Hi Maarten

I have been trying to get the sample docs from a topic model, below is the code up to the point where model is .fit_transformed .

from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer

# Prepare embeddings
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model,embedding_model=sentence_model)
topics, probs = topic_model.fit_transform(docs,embeddings)

then when I try to see the representative docs, I tried .get_representative_docs method, and it returns below error:

topic_model.get_representative_docs(1)

Traceback (most recent call last):
  File "/home/shabnam/anaconda3/envs/rapids-22.08/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3378, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-119-dc9ece6a5a3f>", line 1, in <module>
    topic_model.get_representative_docs(1)
  File "/home/shabnam/anaconda3/envs/rapids-22.08/lib/python3.9/site-packages/bertopic/_bertopic.py", line 1187, in get_representative_docs
    return self.representative_docs_[topic]
TypeError: 'NoneType' object is not subscriptable

then I thought maybe I need to ._save_representative_docs first as below, which again returns error:

topic_model._save_representative_docs(pd.DataFrame({'Topic':docs}))

Traceback (most recent call last):
  File "/home/shabnam/anaconda3/envs/rapids-22.08/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3378, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-120-7bc2076d3c21>", line 1, in <module>
    topic_model._save_representative_docs(pd.DataFrame({'Topic':docs}))
  File "/home/shabnam/anaconda3/envs/rapids-22.08/lib/python3.9/site-packages/bertopic/_bertopic.py", line 2411, in _save_representative_docs
    clusters = sorted(condensed_tree._select_clusters())
  File "/home/shabnam/anaconda3/envs/rapids-22.08/lib/python3.9/site-packages/hdbscan/plots.py", line 264, in _select_clusters
    raise ValueError('Invalid Cluster Selection Method: %s\n'
ValueError: Invalid Cluster Selection Method: %s
Should be one of: "eom", "leaf"

Thanks in advance Shabnam

srashtchi commented 2 years ago

It seems this .get_representative_docs method return error when using gpu acceleration mode. with base BERTopic it returns sample docs. Any chance you could tell mw on above example when using GPU model how can I get same functionality.

MaartenGr commented 2 years ago

The .get_representative_docs method is used only for the CPU-version of HDBSCAN. Whenever you pass a different cluster model from a different package, the representative documents do not get calculated. Since HDBSCAN does this in a specific way, this is not generalizable to all clustering algorithms. With respect to the GPU-version of HDBSCAN, this is still under discussion about what will and what will not be supported in future versions of BERTopic and what functionalities are actually possible to include at this stage of the development of both packages.

srashtchi commented 2 years ago

Thanks Maarten for the clarification, it was really helpful.