Keyword extraction for BERT does not work for less samples

AbhiPawar5 commented 3 years ago

Hi Andrew, I tried the keyword extraction API for just 5 samples in a dataframe.

bert_kws = extract_kws( method="BERT", # "BERT", "LDA", "TFIDF", "frequency" bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens", text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA input_language=input_language, output_language=None, # allows the output to be translated num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, # for TFIDF ignore_words=ignore_words, prompt_remove_words=True, # check words with user show_progress_bar=True, batch_size=3, )

Which returns, ValueError: n_samples=5 should be >= n_clusters=10 for batch_size. I wonder why that's happening? Thanks!

andrewtavis commented 3 years ago

Hi Abhishek, and thanks for pointing this out!

I tested it and was able to reproduce. The issue is that you're trying to derive num_topics = 10 from a corpus that only has five entries (BERT is honestly a bit overkill for this, and you should be fine with just LDA). If you set num_topics = 5, then it will work :)

n_clusters is an argument from scikitlearn's k-means clustering that's used for fitting models, which inherits this value from kwx.extract_kws via the num_topics argument. I hadn't realized that this inheritance was happening and would cause this problem. I just added an assertion to kwx.topic_model via #28 that will raise the following ValueError if such a situation arises in the future:

`num_topics` cannot be larger than the size of `text_corpus` - consider lowering the desired number of topics

Let me know if you think anything else is needed on this, and thanks again for pointing this out!

AbhiPawar5 commented 3 years ago

Hi Andrew, Yes, just mere 5 samples, BERT is an OVERKILL!

I was just playing with all the methods/algos available in this amazing package and found this. But, thanks for clarifying this. Thank you and Take care.

andrewtavis commented 3 years ago

Take care as well!

andrewtavis / kwx

Keyword extraction for BERT does not work for less samples #26