codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.05k stars 2.11k forks source link

[Feature Request] Add CLI support #842

Open jtara1 opened 4 years ago

jtara1 commented 4 years ago

That'd be nice to be able to get the article text to stdout. It could look like

# prints article text, we can redirect to file with sh
newspaper https://www.cnn.com/2020/09/08/health/coronavirus-vaccine-astrazeneca-pause/index.html

or have a --format json option and include all parsed metadata on the article in a json object. sh jq should make it easy enough to process part(s) parsed from stdout anyways.

johnbumgarner commented 3 years ago

I'm unsure why you need the data elements written to stdout. Can you explain this need?

In the meantime, I recently started putting together a detailed Newspaper3k usage document that I'm publicly sharing. This document is available here: https://github.com/johnbumgarner/newspaper3_usage_overview. It contains details on how to write article elements to DataFrames, CSV and JSON files.

P.S. this document is a work in process, so more information will be added.

jtara1 commented 3 years ago

My personal use of this right now is in another scraping project.

One example of cli being useful is just to download and save html or text of an article for reading later offline. For small use cases like this, I could just make a python script to do this anyways.

johnbumgarner commented 3 years ago

Thanks for the info. Did you note the examples that I provided in my github repo to save data elements in multiple formats?

jtara1 commented 3 years ago

Yes, it can help give me a jump-start, thanks.

johnbumgarner commented 3 years ago

Great. LMK if you need any help, because I'm interested in improving my usage document.