fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
1.99k stars 414 forks source link

Scrape by Domain #242

Closed firmai closed 7 months ago

firmai commented 1 year ago

I want to scrape entire domains, like https://www.globenewswire.com, I have been browsing the documentation for 15 mins, and ran a few tests. I am still not sure how to do it, I understand there is a CLI method and a python method. But what do these methods look like, I literally just can't find the examples?

And is that even the right way to go about it, is the common crawl news datasets able to get data on a domain by domain basis?

Thanks a lot.

pax commented 1 year ago

It's in the readme, see #run-the-crawler-via-the-cli, you'd need to have your domain in the base_urls list in your sitelist.hjson file, inside the config directory (which might be located at ~/news-please or ~/news-please-repo for linux/macos).

Before that you'd have to edit the config.cfg with your Mysql, Postgres, Elasticsearch config (user/pwd/host/db). I thought I understood you would only need one storage option, but the cli didn't seem like working without all 3 set-up (or I couldn't figure it out better).

You can also try using the Common Crawl datasets, if your domain is there. For that, you could run newsplease/examples/commoncrawl.py, where you'd add your domain inside the my_filter_valid_hosts list. Note the first comment in that script, about relative imports. You'd need to download/clone the repo and run it with python -m newsplease.examples.commoncrawl from inside the news-please dir. You might need some patience with Common Crawl, as (at least at the time of my writing) it throws some access Errors. See discussion here: 1, 2, 3