codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.19k stars 2.12k forks source link

Config object documentation doesn't seem to be accurate #332

Open christinac opened 7 years ago

christinac commented 7 years ago

Following the instructions here, I tried to introduce a Config object into my code but found that articles weren't downloaded:

>>> config = Config()
>>> config
<newspaper.configuration.Configuration object at 0x107da0358>
>>> config.memoize_articles = False
>>> config.language = 'en'
>>> config.MIN_WORD_COUNT = 400
>>> config
<newspaper.configuration.Configuration object at 0x107da0358>

>>> print(config.language)
en
>>> print(config)
<newspaper.configuration.Configuration object at 0x107da0358>
>>> nyt = newspaper.build('http://nytimes.com', config)
>>> nyt
<newspaper.source.Source object at 0x107da02e8>
>>> nyt.articles
[]

Passing the config parameters to the build function downloaded the articles:

>>> bb = newspaper.build('http://breitbart.com', memoize_articles=False, language='en', MIN_WORD_COUNT=400)
>>> bb.articles
[<newspaper.article.Article object at 0x10bf76b00>, <newspaper.article.Article object at 0x10bf76a90>, <newspaper.article.Article object at 0x10bf76d30>, <newspaper.article.Article object at 0x10bf972b0>, <newspaper.article.Article object at 0x10bf976a0>, <newspaper.article.Article object at 0x10bf97a90>, <newspaper.article.Article object at 0x10bf97cf8>, <newspaper.article.Article object at 0x10c014860>, <newspaper.article.Article object at 0x10c014c50>, ...

It seems like there might be a command (download()?) missing from the documentation.

sixhobbits commented 7 years ago

This isn't a problem with config. You'll notice if you build New York Times without config, you'll get a list of articles, but if you download them and look at the text, all the texts will be empty.

Therefore the "MIN_WORDS_COUNT=400" is filtering all the articles, which is why you get an empty list when you build with the config.

MoritzLaurer commented 5 years ago

EDIT: just randomly found the solution: pass "config = config" as argument, instead of only "config". Documentation seems to be wrong/out-dated here: https://newspaper.readthedocs.io/en/latest/user_guide/advanced.html#parameters-and-configurations

newspaper import Config 
config = Config()
paper = newspaper.build('http://cnn.com', config = config)

I'm having the same experience as @christinac. The build function only returns an object with empty lists for articles etc, if I pass the config object - even if I don't change the default values in the config object. So in my case it does not depend on the "MIN_WORDS_COUNT" as @sixhobbits suspected (I also tried passing very low MIN_WORDS_COUNT values):

newspaper import Config 
config = Config()
paper = newspaper.build('http://cnn.com', config)
paper = newspaper.build('https://www.foxnews.com/', config)

It works fine, however, if I just pass the parameters in the build function individually. paper = newspaper.build('http://cnn.com', language='en', memoize_articles = False, http_success_only = False, MIN_SENT_COUNT = 7, MIN_WORD_COUNT = 300, MAX_TITLE = 300, keep_article_html = True, fetch_images = False)