Open tomthebuzz opened 6 years ago
Yes, i am currently having the same problem. the only workable solution i could find was to manually go to the cache location of the feeds ~/newspaper_scraper/feed_category_cache
and remove the files. i have yet to develop a solution to do this in the py function. Hope this helps.
Thanks for filing this @tomthebuzz and also @naivelogic.
If what you guys are reporting is true then this seems to be a serious bug. I will try to reproduce but can you two also share the precise commands you ran to get this issue so I can verify? + what OS are you using
Hi Lucas,
unfortunately it does not reproduce constantly. While iterating in 10min intervals I have 7-8 out 10 that show this behavior and 2-3 that work as expected. It has improved somewhat since I have included catching download() errors via try / except. Will continue to monitor and revert as soon as I can report something more enlightening.
Cheers -Tom
Tom Debus
Managing Partner
Integration Alpha GmbH Fabrikstrasse 5 6330 Cham Switzerland
mobile: +41 79 335 38 42 email: tom.debus@integrationalpha.com
www.integrationalpha.com
On 27 Aug 2018, at 09:08, Lucas Ou-Yang notifications@github.com wrote:
Thanks for filing this @tomthebuzz and also @naivelogic.
If what you guys are reporting is true then this seems to be a serious bug. I will try to reproduce but can you two also share the precise commands you ran to get this issue so I can verify? + what OS are you using
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
The memoization behavior is beginning to get very annoying since a lot of users are reporting issues with the api out of confusion, indicating the api is not perfect
The way newspaper has handled memoizing content since the start was to cache previously scraped articles on disk and not re-scrape them mostly because a few newspaper.build()
calls on the same website will get you rate limited/banned since the heavy load of requests. Sure, we can let the users/callers themselves do the caching, but the library is already late in design phase and it's late for a big change like that.
I still think memoizing content should be default, but maybe we can force in logging.info statements whenever the memoizing happens so it's very clear when articles are cached/not cached
Hey Lucas, patron my response delay, I do like the memorization functionality bc it limits the amount of processing required. I'm glad the feature is there, because I would had to manually created such a function. However, the caching seems to be the root of the problem where we arent able to interate over a list of URLs.
To remediation this issues, similar to tom's approach, the fix that has sufficiency worked for me is as follows:
import os
cache_to_remove = '/home/<insert user name>/.newspaper_scraper/feed_category_cache/f3b78688afc588cf439322fd84aca09a805e8a6f'
# removal of article from cache that is included in scrapper function
try: os.remove(cache_to_remove)
except OSError: pass
Thanks for your thoughts @naivelogic
In newspaper/utils.py
we have a function available for clearing the cache per news source. Check it out and please suggest improvements in this cache cleaning API
https://github.com/codelucas/newspaper/blob/master/newspaper/utils.py#L273-L280
judging based on the reports from you and @tomthebuzz, perhaps there is a bug where even memoize_articles
is False
, there are still things getting cached when they shouldn't be..
Alternatively, since none of this is deterministic (given that the html scraping portion can return a 404 or 500 error or even a rate limit if the news site feels you are scraping too much) We don't know if the 7 out of 10 times is due to the memoizing behavior having a bug or if the remote news site is returning different data
Since the error is not deterministic, is multi-threading causing this problem?
When this happened for me it was due to rate limiting and being blocked by the sites.
Am getting towards the end of my wisdom: Whenever I manually start a new run over a portfolio of 10 sources and processing the articles I seem to be getting the correct number of articles. If however I execute the same code within an iterating "while TRUE:" loop with a 30-60 minute wait state and with the deletion of the original np.build() variables and with memoize_articles=False (in both Build() and Article() statements) I always seem to be getting only the articles from the initial run, no matter if the source has published new articles within the waiting time or not.
Anyone have made similar experiences and found a workable solution?