fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
1.99k stars 414 forks source link

Get only the recursive list of URLs using the Library mode #236

Closed bakrianoo closed 1 year ago

bakrianoo commented 1 year ago

Describe your question

If I only interested in the list of discovered Recursive list of URLs of a website, what can I perform this from the library mode, so I can feed only the website URL, to get the list of discovered news URLs there. ?

I need to perform this from Python code (Library mode)

Versions (please complete the following information):

Intent [*] academic

krishna-perugupalli commented 1 year ago

The easiest way is to use newspaper3k library to get all the urls. NewsPlease is tightly integrated and also utilizing the same for some functionalities ex: urls and some news articles metadata.

Hope this helps you to explore.

fhamborg commented 1 year ago

retrieving all links in the library mode is not supported currently. feel free to open a PR regarding this functionality :)