Open lasgu opened 7 years ago
Can you provide the steps to reproduce the bug? I can't see how the function can return more than 10 keywords.
You are right about the keywords not being sorted though. The code is using a regular dict()
when it should be returning an OrderedDict
as dict()
is only ordered in Python 3.6.
The following script
from newspaper import Article
a = Article("http://edition.cnn.com/2016/12/16/politics/obama-final-2016-press-conference/index.html")
a.download()
a.parse()
a.nlp()
print(a.keywords)
outputs this list of 13 keywords
['hacking', 'russia', 'political', 'trump', 'president', 'intelligence', 'obama', 'press', 'names', 'conference', 'team', 'russian', 'putin']
while the same script for this article http://edition.cnn.com/2016/12/18/politics/bob-gates-rex-tillerson-trump-russia/index.html outputs those 12 keywords
['relationship', 'russia', 'business', 'tillerson', 'president', 'secretary', 'friendly', 'russian', 'gates', 'rice', 'state', 'friends']
Here's the answer. Look at the following code snippet from a.nlp()
:
text_keyws = list(nlp.keywords(self.text).keys())
title_keyws = list(nlp.keywords(self.title).keys())
keyws = list(set(title_keyws + text_keyws))
self.set_keywords(keyws)
As a.keywords
is a union of the keywords in self.text
and self.title
, it can return more than 10.
One way to resolve the bug is to replace:
keyws = list(set(title_keyws + text_keyws))
with
keyws = set(title_keyws)
while len(keyws) < nlp.NUM_KEYWORDS and text_keyws:
keyws.add(text_keyws.pop(0))
keyws = list(keyws)
nlp.NUM_KEYWORS is currently a local variable in the keywords function defined in nlp.py. Since I need it in this snippet I would promote it to a global variable and set it at the top of nlp.py.
What do you think?
As specified by the comment it should return 10 keywords, it returns more depending on the article. The returned keywords are not sorted by frequency.