MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.53k stars 349 forks source link

Setting `top_n` as a percentage of the input length #73

Closed Unco3892 closed 3 years ago

Unco3892 commented 3 years ago

Hi,

Thank you for providing this great package. I had a question regarding the top_n argument and it's loosely related to some other issues on varying input lengths. Is it possible to set this number as a percentage of the input length rather than an absolute number? I have already attempted that with some input text but for some reason the maximum number of words the model extracts is always lower than the total number of keyword combinations. Here is an example of running this on an abstract from a journal :

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text
from keybert import KeyBERT
import re

max_gram = 3
text = 'Two analytical methods, high performance liquid chromatography and spectrofluorimetry, were studied to determine the content of coumarins (umbelliferone, scopoletin and 4-methylumbelliferone, in distilled beverages). Hydro-alcohol standard solutions of known coumarin concentration and commercial white rum samples were used to compare them. After determining the coumarin content with both methods and performing a statistical analysis of the results obtained, the conclusion was reached that although both techniques are valid for this purpose, the spectrofluorimetric method is more accurate than high performance liquid chromatography.'

# ---
# manual calculation for tri-grams
# to my understanding, `CountVectorizer` seems to be removing parenthesis and
# dots (and also for some reason the digits)
manual_tokens = re.sub('\.|[()]|\d+', '', text)
manual_tokens = re.findall(r'\w+', manual_tokens)
# tokens = [word for word in tokens if word not in text.ENGLISH_STOP_WORDS]
len_unigram_manual = len(manual_tokens)

# same calculation using `sklearn`
cv = CountVectorizer().fit([text])
tokenizer = cv.build_tokenizer()
sklearn_tokens = tokenizer(text)
len_unigram_sklearn = len(sklearn_tokens)

# `sklearn` remove 'a' when tokenizing but the my manaual calculation
# does not, however, this is not so important
# [word for word in manual_tokens if word not in sklearn_tokens]

# then using either one, we can get the tri-gram
def total_len(max_grams, ref_unigram):
  final_count = []
  for i in range(max_grams):
    final_count.append(ref_unigram - i+1)
  return sum(final_count)

count_manual = total_len(max_gram, len_unigram_manual)
count_sklearn = total_len(max_gram, len_unigram_sklearn)

# ---
# same caulation `KeyBERT`
model = KeyBERT(model="LaBSE")
keywords = (model.extract_keywords(
  text,
  top_n=500, # setting an arbitary number
  stop_words= None,
  # stop_words = 'english',#you can also try it with the stopwords
  keyphrase_ngram_range = (1,max_gram))
  )

count_keybert = len(keywords)

# ---
print(count_manual)
print(count_sklearn)
print(count_keybert)
MaartenGr commented 3 years ago

Starting with your first question, there are no plans to make top_n a percentage of the input size. The amount of top_n keywords returned that is useful depends very much on the use case. So it is up to the user to decide what is important. For you, that might be a percentage of the input. This brings me to the second question.

It seems that you assume that the individual tokens (1-grams) can be combined into 2-grams or 3-grams without any limitations. In practice, this is not the case if there are separates between words, such as commas and full stops. The n-grams should also be used in the CountVectorizer module as shown below. Then, you get the list of logically possible n-grams.

To give you an example, below I create a list of keywords with KeyBERT and a list of possible n-grams through CountVectorizer. The result is that they have the same size. In other words, your total_len() procedure does not work since you assume all 1-gram tokens can be combined regardless of the position in the sentence.

nr_possible_words = len(CountVectorizer(ngram_range=(1, 3)).fit([text]).get_feature_names())
keywords = model.extract_keywords(text, top_n=500, keyphrase_ngram_range = (1, max_gram), stop_words=None)

assert nr_possible_words == len(keywords)
Unco3892 commented 3 years ago

Thanks a lot for your answer. Indeed, I did not take into account that a word which appears before a comma or some kind of a separator does not become a bi-gram. This is precisely what I was trying to find. Cheers again.

I think anyone else who would also want to have the keyword as a percentage in the future could also find it useful.