N-gram range as parameter in extract_keywords method

adamwawrzynski commented 3 years ago

I am using this repository for generating keywords from documents and meeting transcriptions. I found out that sometimes it is better to accept keywords ranging in n-gram length, e. g ngram_range=(1, 3) rather than 3-grams, because sometimes 1 word is good keyword and not the whole phrase. Rather than creating my own modified repository I would like to propose modification to Your codebase.

Instead of passing parameter:

    def _extract_keywords_single_doc(self,
                                     doc: str,
                                     keyphrase_length: int = 1,
                                     ...

You could provide parameter as follow:

    def _extract_keywords_single_doc(self,
                                     doc: str,
                                     keyphrase_ngram_range: Tuple[int, int] = (1,1),
                                     ...

for both methods _extract_keywords_single_doc and _extract_keywords_multiple_docs.

Let me know what do You think about this.

MaartenGr commented 3 years ago

During the preliminary results when testing the solution I found that the cosine similarity between vectors of two long strings is typically higher than the cosine similarity between a short string and one long string. For that reason, I decided to fix the n_gram_range as it requires significantly more computational power to add the calculation of single words.

However, seeing your question and realizing that MMR is likely to also select lower n_grams if supplied, I agree that it would be nice to add the entire range instead of a single value.

I'll look into it!

oterrier commented 3 years ago

Hi Marteen +1 for the suggestion of providing a n_gram_range instead of just a keyphrase_length Moreover I think it could be a good idea to allow to pass a custom CountVectorizer as you did for the amazing BERTopic package! Thx in advance and take care

Olivier Terrier

MaartenGr commented 3 years ago

Added both keyphrase_ngram_range and custom count vectorizer to KeyBERT. Update KeyBERT to 0.1.3 to use the changes.

MaartenGr / KeyBERT

N-gram range as parameter in extract_keywords method #13