Closed AbhiPawar5 closed 3 years ago
Hi Abhishek, and thanks for pointing this out!
I tested it and was able to reproduce. The issue is that you're trying to derive num_topics
= 10 from a corpus that only has five entries (BERT is honestly a bit overkill for this, and you should be fine with just LDA). If you set num_topics
= 5, then it will work :)
n_clusters
is an argument from scikitlearn's k-means clustering that's used for fitting models, which inherits this value from kwx.extract_kws
via the num_topics
argument. I hadn't realized that this inheritance was happening and would cause this problem. I just added an assertion to kwx.topic_model
via #28 that will raise the following ValueError
if such a situation arises in the future:
`num_topics` cannot be larger than the size of `text_corpus` - consider lowering the desired number of topics
Let me know if you think anything else is needed on this, and thanks again for pointing this out!
Hi Andrew, Yes, just mere 5 samples, BERT is an OVERKILL!
I was just playing with all the methods/algos available in this amazing package and found this. But, thanks for clarifying this. Thank you and Take care.
Take care as well!
Hi Andrew, I tried the keyword extraction API for just 5 samples in a dataframe.
bert_kws = extract_kws( method="BERT", # "BERT", "LDA", "TFIDF", "frequency" bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens", text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA input_language=input_language, output_language=None, # allows the output to be translated num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, # for TFIDF ignore_words=ignore_words, prompt_remove_words=True, # check words with user show_progress_bar=True, batch_size=3, )
Which returns, ValueError: n_samples=5 should be >= n_clusters=10 for batch_size. I wonder why that's happening? Thanks!