Closed Unco3892 closed 3 years ago
Starting with your first question, there are no plans to make top_n
a percentage of the input size. The amount of top_n
keywords returned that is useful depends very much on the use case. So it is up to the user to decide what is important. For you, that might be a percentage of the input. This brings me to the second question.
It seems that you assume that the individual tokens (1-grams) can be combined into 2-grams or 3-grams without any limitations. In practice, this is not the case if there are separates between words, such as commas and full stops. The n-grams
should also be used in the CountVectorizer
module as shown below. Then, you get the list of logically possible n-grams.
To give you an example, below I create a list of keywords with KeyBERT and a list of possible n-grams through CountVectorizer
. The result is that they have the same size. In other words, your total_len()
procedure does not work since you assume all 1-gram tokens can be combined regardless of the position in the sentence.
nr_possible_words = len(CountVectorizer(ngram_range=(1, 3)).fit([text]).get_feature_names())
keywords = model.extract_keywords(text, top_n=500, keyphrase_ngram_range = (1, max_gram), stop_words=None)
assert nr_possible_words == len(keywords)
Thanks a lot for your answer. Indeed, I did not take into account that a word which appears before a comma or some kind of a separator does not become a bi-gram. This is precisely what I was trying to find. Cheers again.
I think anyone else who would also want to have the keyword as a percentage in the future could also find it useful.
Hi,
Thank you for providing this great package. I had a question regarding the
top_n
argument and it's loosely related to some other issues on varying input lengths. Is it possible to set this number as a percentage of the input length rather than an absolute number? I have already attempted that with some input text but for some reason the maximum number of words the model extracts is always lower than the total number of keyword combinations. Here is an example of running this on an abstract from a journal :