codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.1k stars 2.11k forks source link

nlp.py keywords() return wrong values #311

Open lasgu opened 7 years ago

lasgu commented 7 years ago

As specified by the comment it should return 10 keywords, it returns more depending on the article. The returned keywords are not sorted by frequency.

thundergolfer commented 7 years ago

Can you provide the steps to reproduce the bug? I can't see how the function can return more than 10 keywords.

You are right about the keywords not being sorted though. The code is using a regular dict() when it should be returning an OrderedDict as dict() is only ordered in Python 3.6.

lasgu commented 7 years ago

The following script

from newspaper import Article

a = Article("http://edition.cnn.com/2016/12/16/politics/obama-final-2016-press-conference/index.html")

a.download()
a.parse()
a.nlp()
print(a.keywords)

outputs this list of 13 keywords ['hacking', 'russia', 'political', 'trump', 'president', 'intelligence', 'obama', 'press', 'names', 'conference', 'team', 'russian', 'putin']

while the same script for this article http://edition.cnn.com/2016/12/18/politics/bob-gates-rex-tillerson-trump-russia/index.html outputs those 12 keywords ['relationship', 'russia', 'business', 'tillerson', 'president', 'secretary', 'friendly', 'russian', 'gates', 'rice', 'state', 'friends']

thundergolfer commented 7 years ago

Here's the answer. Look at the following code snippet from a.nlp():

text_keyws = list(nlp.keywords(self.text).keys())
title_keyws = list(nlp.keywords(self.title).keys())
keyws = list(set(title_keyws + text_keyws))
self.set_keywords(keyws)

As a.keywords is a union of the keywords in self.text and self.title, it can return more than 10.

arw2019 commented 4 years ago

One way to resolve the bug is to replace:

keyws = list(set(title_keyws + text_keyws))

with

keyws = set(title_keyws)
while len(keyws) < nlp.NUM_KEYWORDS and text_keyws:
     keyws.add(text_keyws.pop(0))
keyws = list(keyws)

nlp.NUM_KEYWORS is currently a local variable in the keywords function defined in nlp.py. Since I need it in this snippet I would promote it to a global variable and set it at the top of nlp.py.

What do you think?