MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.5k stars 345 forks source link

Differences between KeyBERT and BERTopic #60

Closed shoegazerstella closed 3 years ago

shoegazerstella commented 3 years ago

Hi, thanks for sharing these projects, super neat work!

I just wanted to ask which are the main differences between KeyBERT and BERTopic. The two approaches may look similar, as one of the approaches of BERTopic can be maybe applied exactly to recreate KeyBERT.

In which case should I use one instead of the other in your opinion? Thanks!

MaartenGr commented 3 years ago

Although the approach may look similar, their implementation is actually quite different. In practice, you will not be able to recreate KeyBERT with BERTopic and vice versa. To make this clear, I'll go through the models individually and then compare them.

BERTopic

The procedure of BERTopic is demonstrated below:

image

Here, you can see that there are three distinct steps:

  1. Embedding documents
  2. Clustering documents
  3. Creating a topic representation.

The main output of BERTopic is a set of words per topic. Thus, multiple documents have the same topic representation.

KeyBERT

KeyBERT can roughly be divided into the following steps:

  1. Embedding documents
  2. Creating candidate keywords
  3. Calculating best keywords through either MMR, Max Sum Similarity, or Cosine Similarity

The main output of KeyBERT is a set of words per document. Thus, each document is expected to have different keywords.

BERTopic vs. KeyBERT

The main similarities between the two methods are that they embed documents and leverage MMR (although both models may opt not to). To me, that is essentially where the similarities end. The main difference is everything that happens between embedding documents and, in some cases, leveraging MMR. For example, BERTopic aims to cluster documents and create a broad representation of multiple documents whereas KeyBERT does not. Moreover, when it comes down to algorithmic implementation, the UMAP/HDBSCAN/c-TF-IDF route is quite different from generating candidate keywords and comparing them to the individual documents.

When to use BERTopic vs. KeyBERT

As you might have already noticed from the descriptions above, both the purpose and output of the methods differ. BERTopic, and in that sense most topic modeling techniques, are meant to explore the data to create an understanding of the perhaps millions of documents that you have collected. KeyBERT, in contrast, is not able to do this as it creates a completely different set of words per document. An example of using KeyBERT, and in that sense most keyword extraction algorithms, is automatically creating relevant keywords for content (blogs, articles, etc.) that businesses post on their website.

P.S. I kinda went overboard with this explanation but seeing as there were several people that liked your question it seemed to be important to several others. If I wasn't clear of if you have any follow-up questions, don't hesitate to ask!

shoegazerstella commented 3 years ago

Hello @MaartenGr and thanks a lot for the clean clarification!

voxmenthe commented 1 year ago

Great explanation really appreciated being able to find this thanks!