csurfer / rake-nltk

Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.
https://csurfer.github.io/rake-nltk
MIT License
1.06k stars 150 forks source link

Bug - phrase_list shouldn't be a set at the beginning #22

Closed LetianFeng closed 3 years ago

LetianFeng commented 6 years ago

This implementation ignores phases with multiple occurence, for example: text = 'Red apples, are good in flavour. Where are my red apples? Apples!'

According to the paper, we should get a list of phrases and their weights like: ['red apples', 'good', 'flavor', 'red apples', 'apples']

word good flavour apples red
degree 1 1 5 4
frequency 1 1 3 2
ratio 1 1 1.67 2

So the correct ranked phrases should be:

(3.67, 'red apples')
(1.67, 'apples')
(1.0, 'good')
(1.0, 'flavour')

However, in the current implementation, the extracted phrase list is: ['red apples', 'good', 'flavor', 'apples']

Obviously, the second 'red apples' is ignored, so the ranked phrases have wrong scores:

(3.5, 'red apples')
(1.5, 'apples')
(1.0, 'good')
(1.0, 'flavour')]

This bug could be fixed very easily, simply change the function extract_keywords_from_sentences and _generate_phrases as shown below:

    def extract_keywords_from_sentences(self, sentences):
        """Method to extract keywords from the list of sentences provided.

        :param sentences: Text to extraxt keywords from, provided as a list
                          of strings, where each string is a sentence.
        """
        phrase_list = self._generate_phrases(sentences)
        self._build_frequency_dist(phrase_list)
        self._build_word_co_occurance_graph(phrase_list)
        self._build_ranklist(set(phrase_list))
    def _generate_phrases(self, sentences):
        """Method to generate contender phrases given the sentences of the text
        document.

        :param sentences: List of strings where each string represents a
                          sentence which forms the text.
        :return: Set of string tuples where each tuple is a collection
                 of words forming a contender phrase.
        """
        phrase_list = []
        # Create contender phrases from sentences.
        for sentence in sentences:
            word_list = [word.lower() for word in wordpunct_tokenize(sentence)]
            phrase_list += self._get_phrase_list_from_words(word_list)
        return phrase_list
csurfer commented 3 years ago

Didn't see that this was already reported. I was adding type hints and caught this bug myself too but later than you did. In v1.0.5 this is put up as a feature providing the user the control whether they want to use unique phrases in phrase list or non unique ones using a flag as indicated here