MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.09k stars 758 forks source link

Llama 2: Representation Model #1713

Open Keamww2021 opened 10 months ago

Keamww2021 commented 10 months ago

In the documentation, it mentioned that:

_By default, four of the most representative documents will be passed to [DOCUMENTS]. These documents are selected by calculating their similarity (through c-TF-IDF representations) with the main c-TF-IDF representation of the topics. The four best matching documents per topic are selected. To increase the number of documents passed to [DOCUMENTS], we can use the nrdocs parameter which is accessible in all LLMs on this page.

I am using Llama 2 to represent a label for each cluster. but I don't want to pass only the four of the most representative documents; I want to pass the top 10 or 20 or maybe all the documents.

I try to use the nr_docs parameter, but when I check the number of the representative documents, it displays 3.

llama2 = TextGeneration(generator, nr_docs= 20, prompt=prompt)

How can we ensure that Llama 2 provides an accurate and appropriate name if we don't have all of the cluster's documents?

Keamww2021 commented 10 months ago

I have a question: how many documents should be when I print the representative documents?

representativedocs

MaartenGr commented 9 months ago

I try to use the nr_docs parameter, but when I check the number of the representative documents, it displays 3.

That is because the number of representative documents are only passed to Llama 2, they are not saved within BERTopic. By default, only the 4 most representative documents are saved. If you want to check which ones are passed to Llama 2, you can check llama2.prompts_.

Keamww2021 commented 9 months ago

Thank you for your response

How can i pass 20 documents per topic to Llama 2 in order to use that 20 documents to generate a lable.

On Thu, 28 Dec 2023 at 1:12 PM Maarten Grootendorst < @.***> wrote:

I try to use the nr_docs parameter, but when I check the number of the representative documents, it displays 3.

That is because the number of representative documents are only passed to Llama 2, they are not saved within BERTopic. By default, only the 4 most representative documents are saved. If you want to check which ones are passed to Llama 2, you can check llama2.prompts_.

— Reply to this email directly, view it on GitHub https://github.com/MaartenGr/BERTopic/issues/1713#issuecomment-1871021926, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWHZTGW26OKXDEENMND5WW3YLVAXPAVCNFSM6AAAAABBBLCPISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZRGAZDCOJSGY . You are receiving this because you authored the thread.Message ID: @.***>

MaartenGr commented 9 months ago

Using the nr_docs in TextGeneration will pass 20 representative documents per topic to Llama 2.