So I know that I can building a news site crawls over all available news of the website:
cnn_paper = newspaper.build('https://cnn.com')
But how about when I want to get only newest news? In my case default caching architecture of Newspaper3k does not work because I run a never-ending crawler and my cache will finally end up overflowing.
One other problem is I can obtain publish_date of articles only after downloading&parsing each news.
Is there a workaround for this case? I think it will be very useful for many users too.
So I know that I can building a news site crawls over all available news of the website:
cnn_paper = newspaper.build('https://cnn.com')
But how about when I want to get only newest news? In my case default caching architecture of Newspaper3k does not work because I run a never-ending crawler and my cache will finally end up overflowing.
One other problem is I can obtain publish_date of articles only after downloading&parsing each news.
Is there a workaround for this case? I think it will be very useful for many users too.