Scrape by Domain - Githubissues

It's in the readme, see #run-the-crawler-via-the-cli, you'd need to have your domain in the base_urls list in your sitelist.hjson file, inside the config directory (which might be located at ~/news-please or ~/news-please-repo for linux/macos).

Before that you'd have to edit the config.cfg with your Mysql, Postgres, Elasticsearch config (user/pwd/host/db). I thought I understood you would only need one storage option, but the cli didn't seem like working without all 3 set-up (or I couldn't figure it out better).

You can also try using the Common Crawl datasets, if your domain is there. For that, you could run newsplease/examples/commoncrawl.py, where you'd add your domain inside the my_filter_valid_hosts list. Note the first comment in that script, about relative imports. You'd need to download/clone the repo and run it with python -m newsplease.examples.commoncrawl from inside the news-please dir. You might need some patience with Common Crawl, as (at least at the time of my writing) it throws some access Errors. See discussion here: 1, 2, 3

fhamborg / news-please

Scrape by Domain #242