Closed firmai closed 7 months ago
It's in the readme, see #run-the-crawler-via-the-cli, you'd need to have your domain in the base_urls
list in your sitelist.hjson file, inside the config directory (which might be located at ~/news-please or ~/news-please-repo for linux/macos).
Before that you'd have to edit the config.cfg
with your Mysql, Postgres, Elasticsearch config (user/pwd/host/db). I thought I understood you would only need one storage option, but the cli didn't seem like working without all 3 set-up (or I couldn't figure it out better).
You can also try using the Common Crawl datasets, if your domain is there. For that, you could run newsplease/examples/commoncrawl.py, where you'd add your domain inside the my_filter_valid_hosts
list. Note the first comment in that script, about relative imports. You'd need to download/clone the repo and run it with python -m newsplease.examples.commoncrawl
from inside the news-please
dir.
You might need some patience with Common Crawl, as (at least at the time of my writing) it throws some access Errors. See discussion here: 1, 2, 3
I want to scrape entire domains, like https://www.globenewswire.com, I have been browsing the documentation for 15 mins, and ran a few tests. I am still not sure how to do it, I understand there is a CLI method and a python method. But what do these methods look like, I literally just can't find the examples?
And is that even the right way to go about it, is the common crawl news datasets able to get data on a domain by domain basis?
Thanks a lot.