Open christinac opened 7 years ago
This isn't a problem with config. You'll notice if you build New York Times without config, you'll get a list of articles, but if you download them and look at the text, all the texts will be empty.
Therefore the "MIN_WORDS_COUNT=400" is filtering all the articles, which is why you get an empty list when you build with the config.
EDIT: just randomly found the solution: pass "config = config" as argument, instead of only "config". Documentation seems to be wrong/out-dated here: https://newspaper.readthedocs.io/en/latest/user_guide/advanced.html#parameters-and-configurations
newspaper import Config
config = Config()
paper = newspaper.build('http://cnn.com', config = config)
I'm having the same experience as @christinac. The build function only returns an object with empty lists for articles etc, if I pass the config object - even if I don't change the default values in the config object. So in my case it does not depend on the "MIN_WORDS_COUNT" as @sixhobbits suspected (I also tried passing very low MIN_WORDS_COUNT values):
newspaper import Config
config = Config()
paper = newspaper.build('http://cnn.com', config)
paper = newspaper.build('https://www.foxnews.com/', config)
It works fine, however, if I just pass the parameters in the build function individually.
paper = newspaper.build('http://cnn.com', language='en', memoize_articles = False, http_success_only = False, MIN_SENT_COUNT = 7, MIN_WORD_COUNT = 300, MAX_TITLE = 300, keep_article_html = True, fetch_images = False)
Following the instructions here, I tried to introduce a Config object into my code but found that articles weren't downloaded:
Passing the config parameters to the
build
function downloaded the articles:It seems like there might be a command (
download()
?) missing from the documentation.