Bug - phrase_list shouldn't be a set at the beginning

This implementation ignores phases with multiple occurence, for example: text = 'Red apples, are good in flavour. Where are my red apples? Apples!'

According to the paper, we should get a list of phrases and their weights like: ['red apples', 'good', 'flavor', 'red apples', 'apples']

word	good	flavour	apples	red
degree	1	1	5	4
frequency	1	1	3	2
ratio	1	1	1.67	2

So the correct ranked phrases should be:

(3.67, 'red apples')
(1.67, 'apples')
(1.0, 'good')
(1.0, 'flavour')

However, in the current implementation, the extracted phrase list is: ['red apples', 'good', 'flavor', 'apples']

Obviously, the second 'red apples' is ignored, so the ranked phrases have wrong scores:

(3.5, 'red apples')
(1.5, 'apples')
(1.0, 'good')
(1.0, 'flavour')]

This bug could be fixed very easily, simply change the function extract_keywords_from_sentences and _generate_phrases as shown below:

    def extract_keywords_from_sentences(self, sentences):
        """Method to extract keywords from the list of sentences provided.

        :param sentences: Text to extraxt keywords from, provided as a list
                          of strings, where each string is a sentence.
        """
        phrase_list = self._generate_phrases(sentences)
        self._build_frequency_dist(phrase_list)
        self._build_word_co_occurance_graph(phrase_list)
        self._build_ranklist(set(phrase_list))

    def _generate_phrases(self, sentences):
        """Method to generate contender phrases given the sentences of the text
        document.

        :param sentences: List of strings where each string represents a
                          sentence which forms the text.
        :return: Set of string tuples where each tuple is a collection
                 of words forming a contender phrase.
        """
        phrase_list = []
        # Create contender phrases from sentences.
        for sentence in sentences:
            word_list = [word.lower() for word in wordpunct_tokenize(sentence)]
            phrase_list += self._get_phrase_list_from_words(word_list)
        return phrase_list

csurfer / rake-nltk

Bug - phrase_list shouldn't be a set at the beginning #22