MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 765 forks source link

Zero-Shot #1982

Open sucduit opened 6 months ago

sucduit commented 6 months ago

I have a question about Zero-Shot. I used Zero -shot BERTOPIC to do topic mining for my dissertation. I need to explain in more detail about the process. In the case, zero-shot and HDBSCAN are initiated concurrently or Zero-shot classification precede HDBSCAN clustering? I asked GPT4, at first, it said do HDBSCAN first and then use Zeroshot to label the document. then I give the flowchart to GPT4, it said do Zero-shot first and then HDBSCAN. Then I asked a few questions and GPT4 said look like "Simultaneous Processing Paths: zero shot and HDBSCAN as two paths. So if you could provide more detailed explanation about the process I will appreciate it very much since the committee may ask such questions. Thanks again.

MaartenGr commented 6 months ago

As a general tip, GPT-4, albeit an amazing LLM, is not necessarily the best tool for fact-based information even if you supply it with the source material. As you noticed, there is a risk that GPT-4 gives the wrong answer but that is does not realize it. When it comes to facts, I would advise always checking the source material first as it is important to be able to read the docs as well as the underlying code.

Having said that, you can find more about the technique in the documentation:

This method works as follows. First, we create a number of labels for our predefined topics and embed them using any embedding model. Then, we compare the embeddings of the documents with the predefined labels using cosine similarity. If they pass a user-defined threshold, the zero-shot topic is assigned to a document. If it does not, then that document, along with others, will be put through a regular BERTopic model.

In other words, the zero-shot topics are assigned first and precede the HDBSCAN clustering. Then, both models are merged.

sucduit commented 6 months ago

Thank you for explaining. That is a very helpful explanation.

On Fri, May 10, 2024 at 1:44 AM Maarten Grootendorst < @.***> wrote:

As a general tip, GPT-4, albeit an amazing LLM, is not necessarily the best tool for fact-based information even if you supply it with the source material. As you noticed, there is a risk that GPT-4 gives the wrong answer but that is does not realize it. When it comes to facts, I would advise always checking the source material first as it is important to be able to read the docs as well as the underlying code.

Having said that, you can find more about the technique in the documentation https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html :

This method works as follows. First, we create a number of labels for our predefined topics and embed them using any embedding model. Then, we compare the embeddings of the documents with the predefined labels using cosine similarity. If they pass a user-defined threshold, the zero-shot topic is assigned to a document. If it does not, then that document, along with others, will be put through a regular BERTopic model.

In other words, the zero-shot topics are assigned first and precede the HDBSCAN clustering. Then, both models are merged.

— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/1982#issuecomment-2103972917, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOOHVEESEZMKS3WI73R6YXLZBRUENAVCNFSM6AAAAABHPRVIA6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBTHE3TEOJRG4 . You are receiving this because you authored the thread.Message ID: @.***>