[Deployment] Investigate using scrapyd for managing scrapers

margelatu commented 7 years ago

Scrapyd is an application for deploying and running Scrapy spiders. It enables one to deploy scrapers and control their spiders using a JSON API.

It typically runs as a daemon that listens to requests for spiders to run and spawns a process for each one.

This issue is about investigating whether Scrapyd could be useful in the deployment model or not.

mgax commented 7 years ago

HTTP JSON API to deploy scrapers and schedule jobs.
Provides minimal (read: ugly; see attached screenshots) web UI to see running scrapers and their output.
Using scrapyd-client, scrapers are made into Python packages in the ancient "egg" format, and uploaded to scrapyd.
After uploading, jobs are scheduled by making POST requests to scrapyd.
Python3 support is only available in the current pre-release version but it seems to work fine.
We'd have to install dependencies (requests, raven) along with scrapyd, I was not able to auto-install them from the uploaded scraper's package metadata.
Options to inject configuration (API_TOKEN, SENTRY_DSN):
- environment variables of the scrapyd process
- hardcode them in settings.py of the uploaded scraper package
- set them when scheduling a job
Can configure number of concurrent jobs.
Each scraper's log output is saved as text file, accessible via HTTP.
Only works for spiders written with scrapy.

screen shot 2017-04-10 at 21 31 16

margelatu commented 7 years ago

Thanks for taking care of this.

code4romania / czl-scrape