bthapa94 commented 1 year ago

Hello,

I am trying to print the top 10 key phrases from DatasetA['Description'] - it is a column with 4k text entries. However, I am getting list (print keyphrase) of all 3-6 grams phrases. No specific order. How do I ensure only top 10 is printed. Furthermore, how can I only print non-similar things (diversity). Thoughts?

from keybert import KeyBERT doc = DatasetA['Description'] model = KeyBERT('distilbert-base-nli-mean-tokens') keywords = kw_model.extract_keywords(doc) from keyphrase_vectorizers import KeyphraseCountVectorizer

kw_model.extract_keywords(docs=doc, vectorizer=KeyphraseCountVectorizer())

model.extract_keyphrases(doc, keyphrase_ngram_range=(3, 6), stop_words=None, use_mmr=True, top_n=10)

keyphrases = model.extract_keywords(doc, keyphrase_ngram_range=(3, 6), stop_words='english', use_maxsum=True, top_n=10) for keyphrase in keyphrases: print(keyphrase)

MaartenGr commented 1 year ago

To use diversity, you would have to use use_mmr=True together with diversity=0.5 or something higher to diversify the output. Furthermore, the model should return the top_n keyphrases if there are at least top_n keyphrases in the document. If not, less will be outputted.

bthapa94 commented 1 year ago

Please see the output below. It is printing almost everything without sorting...thoughts?

MaartenGr commented 1 year ago

Based on your warning, did you make sure that you are using the most recent version of BERTopic? The most current version is v0.7.

bthapa94 commented 1 year ago

So, if I you do .tolist, it will print top 10 of every row whereas .join will yield top 10 of the entire document.

text = ' '.join(DatasetA['Description']) vs. DatasetA['Description'].tolist()

Another question, how do I gather the bottom 10? Do you recommend diversity to 1? or closer to 1?

MaartenGr commented 1 year ago

You cannot get the bottom 10 as only the top words are provided. There is a chance of lower words getting higher with diversity=1 but there is no guarantee that you get all the bottom 10. Most likely, you will still get many high keywords, as it is typically the use case for extracting keywords. If you want the bottom 10, then those are typically stop words like "the", "and", "I", etc.

MaartenGr / KeyBERT

KeyPhrases Not Printing top 10. #167

model.extract_keyphrases(doc, keyphrase_ngram_range=(3, 6), stop_words=None, use_mmr=True, top_n=10)