codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
13.89k stars 2.1k forks source link

Where to find and delete all articles? #959

Open steeljardas opened 1 year ago

steeljardas commented 1 year ago

I am using Newspaper3k on around 20k articles, where would I need to go to delete all these articles that Newspaper3k is downloading?

johnbumgarner commented 1 year ago

If memoize_articles is not set to False then Newspaper will cache the article's urls and associated data in your system's temp directory. Here are some details on this cache in my Newspaper3k Overview Document.

NiravJoshi33 commented 6 months ago

I believe what @steeljardas is asking is how to delete the cache?

AndyTheFactory commented 6 months ago

the cache folder is ANCHOR_DIRECTORY https://github.com/codelucas/newspaper/blob/f622011177f6c2e95e48d6076561e21c016f08c3/newspaper/settings.py#L48

normally it would be

/tmp/.newspaper_scraper/feed_category_cache
NiravJoshi33 commented 6 months ago

Thanks @AndyTheFactory

johnbumgarner commented 6 months ago

@AndyTheFactory Yes, I agree that @steeljardas was looking for a way to delete all the memoize articles. The document that I mentioned contains information on the cache's location.

AndyTheFactory commented 6 months ago

@johnbumgarner I have read your very good documentation!

your great work inspired me to keep this software alive as a new package https://github.com/AndyTheFactory/newspaper4k

there were a lot of problems and bugs, but i have the sense it's moving in the right direction. I will release a new version pretty soon with a lot of fixes and improvements.

Have a very good new year! and many thanks for your great work!

johnbumgarner commented 6 months ago

@AndyTheFactory Thanks. I will reference your fork in my document. You reference that newspaper3k was last updated in September 2020. The correct date is September 2018. That is the date of the last code push to PyPI. And you are correct there are a lot of bugs in the current code base. I started a new project called NewsHound, but never pushed the code, because someone wanted to use it commercially. They lost their funding and now I have to revisit the code. The issue that I have found with OpenSource projects is that everyone wants to use them, but few people will put the effort in help someone maintain a project. Good Luck with your fork...