csurfer / rake-nltk

Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.
https://csurfer.github.io/rake-nltk
MIT License
1.06k stars 150 forks source link

Add a feature to limit min and max phrase length + resolve issue #4 #21

Closed LetianFeng closed 6 years ago

LetianFeng commented 6 years ago

This is a pull request to resolve the issue #4

The length limit should be applied before building the co-occurence matrix. Otherwise, we count meaningless phrases in the co-occurence matrix, and this could lead to a wrong ranked phrase list. Not only scores, order could also be influenced.

LetianFeng commented 6 years ago

Hi @csurfer could you please take a look at this pull request? Thx!

csurfer commented 6 years ago

Thank you for the pull request! 😄

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.4%) to 89.773% when pulling 6eb828caa166f886a0939af01200885c06158d38 on LetianFeng:min_max_word_length into 0f6988223bed9fc1d91020f7691617edee36ca07 on csurfer:master.

LetianFeng commented 6 years ago

Hi @csurfer , I pushed a new commit according to your suggestions.

Since return [filter(lambda item: self.min_length <= len(item) <= self.max_length, phrases)] will return a list of filter object like [<filter object at 0x7f1990d12e10>], I used list(filter(lambda x: self.min_length <= len(x) <= self.max_length, phrases)) instead.

Personally, I don't have any preference between generator expression and lambda expression, but I feel a little bit weird when I put lambda expression in the second line:

        return list(filter(
            lambda x: self.min_length <= len(x) <= self.max_length, phrases))

Maybe you can read this disscussion list comprehension vs. lambda + filter if you have enough time.

Note: I know nothing about the README.rst, so please take a look at it before merging.

LetianFeng commented 6 years ago

Oops, already merged :sweat_smile: , then have a nice weekend and good luck with this project!

csurfer commented 6 years ago

I meant list(filter()) but mistyped it as [filtrt()]. I will go over the documentation and code and correct any minor changes I think are needed. No worries.