codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.14k stars 2.12k forks source link

Iterating over multiple runs - no new articles in spite of memoize=False #605

Open tomthebuzz opened 6 years ago

tomthebuzz commented 6 years ago

Am getting towards the end of my wisdom: Whenever I manually start a new run over a portfolio of 10 sources and processing the articles I seem to be getting the correct number of articles. If however I execute the same code within an iterating "while TRUE:" loop with a 30-60 minute wait state and with the deletion of the original np.build() variables and with memoize_articles=False (in both Build() and Article() statements) I always seem to be getting only the articles from the initial run, no matter if the source has published new articles within the waiting time or not.

Anyone have made similar experiences and found a workable solution?

naivelogic commented 6 years ago

Yes, i am currently having the same problem. the only workable solution i could find was to manually go to the cache location of the feeds ~/newspaper_scraper/feed_category_cache and remove the files. i have yet to develop a solution to do this in the py function. Hope this helps.

codelucas commented 6 years ago

Thanks for filing this @tomthebuzz and also @naivelogic.

If what you guys are reporting is true then this seems to be a serious bug. I will try to reproduce but can you two also share the precise commands you ran to get this issue so I can verify? + what OS are you using

tomthebuzz commented 6 years ago

Hi Lucas,

unfortunately it does not reproduce constantly. While iterating in 10min intervals I have 7-8 out 10 that show this behavior and 2-3 that work as expected. It has improved somewhat since I have included catching download() errors via try / except. Will continue to monitor and revert as soon as I can report something more enlightening.

Cheers -Tom


Tom Debus

Managing Partner

Integration Alpha GmbH Fabrikstrasse 5 6330 Cham Switzerland

mobile: +41 79 335 38 42 email: tom.debus@integrationalpha.com

www.integrationalpha.com

On 27 Aug 2018, at 09:08, Lucas Ou-Yang notifications@github.com wrote:

Thanks for filing this @tomthebuzz and also @naivelogic.

If what you guys are reporting is true then this seems to be a serious bug. I will try to reproduce but can you two also share the precise commands you ran to get this issue so I can verify? + what OS are you using

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

codelucas commented 6 years ago

The memoization behavior is beginning to get very annoying since a lot of users are reporting issues with the api out of confusion, indicating the api is not perfect

The way newspaper has handled memoizing content since the start was to cache previously scraped articles on disk and not re-scrape them mostly because a few newspaper.build() calls on the same website will get you rate limited/banned since the heavy load of requests. Sure, we can let the users/callers themselves do the caching, but the library is already late in design phase and it's late for a big change like that.

I still think memoizing content should be default, but maybe we can force in logging.info statements whenever the memoizing happens so it's very clear when articles are cached/not cached

naivelogic commented 6 years ago

Hey Lucas, patron my response delay, I do like the memorization functionality bc it limits the amount of processing required. I'm glad the feature is there, because I would had to manually created such a function. However, the caching seems to be the root of the problem where we arent able to interate over a list of URLs.

To remediation this issues, similar to tom's approach, the fix that has sufficiency worked for me is as follows:

import os
cache_to_remove = '/home/<insert user name>/.newspaper_scraper/feed_category_cache/f3b78688afc588cf439322fd84aca09a805e8a6f'

# removal of article from cache that is included in scrapper function
try: os.remove(cache_to_remove)
    except OSError: pass
codelucas commented 6 years ago

Thanks for your thoughts @naivelogic

In newspaper/utils.py we have a function available for clearing the cache per news source. Check it out and please suggest improvements in this cache cleaning API

https://github.com/codelucas/newspaper/blob/master/newspaper/utils.py#L273-L280

judging based on the reports from you and @tomthebuzz, perhaps there is a bug where even memoize_articles is False, there are still things getting cached when they shouldn't be..

Alternatively, since none of this is deterministic (given that the html scraping portion can return a 404 or 500 error or even a rate limit if the news site feels you are scraping too much) We don't know if the 7 out of 10 times is due to the memoizing behavior having a bug or if the remote news site is returning different data

agnelvishal commented 5 years ago

Since the error is not deterministic, is multi-threading causing this problem?

ghost commented 3 years ago

When this happened for me it was due to rate limiting and being blocked by the sites.