Closed adamwawrzynski closed 3 years ago
During the preliminary results when testing the solution I found that the cosine similarity between vectors of two long strings is typically higher than the cosine similarity between a short string and one long string. For that reason, I decided to fix the n_gram_range as it requires significantly more computational power to add the calculation of single words.
However, seeing your question and realizing that MMR is likely to also select lower n_grams if supplied, I agree that it would be nice to add the entire range instead of a single value.
I'll look into it!
Hi Marteen
+1 for the suggestion of providing a n_gram_range
instead of just a keyphrase_length
Moreover I think it could be a good idea to allow to pass a custom CountVectorizer as you did for the amazing BERTopic package!
Thx in advance and take care
Olivier Terrier
Added both keyphrase_ngram_range
and custom count vectorizer to KeyBERT. Update KeyBERT to 0.1.3 to use the changes.
I am using this repository for generating keywords from documents and meeting transcriptions. I found out that sometimes it is better to accept keywords ranging in n-gram length, e. g
ngram_range=(1, 3)
rather than 3-grams, because sometimes 1 word is good keyword and not the whole phrase. Rather than creating my own modified repository I would like to propose modification to Your codebase.Instead of passing parameter:
You could provide parameter as follow:
for both methods
_extract_keywords_single_doc
and_extract_keywords_multiple_docs
.Let me know what do You think about this.