I'm wondering how hard it would be to make it possible to set the base_url through the Crawly.Spider.init/1 options. I have an SQS queue that gets filled with web crawling jobs by a recrawl scheduler, and I want to run crawlers on different machines. The SQS messages look like this: {"start_urls": ["http://example.org/"], "base_url": "http://example.org/"}. It would be straightforward to get the options from the queue in the init function.
I'm wondering how hard it would be to make it possible to set the
base_url
through theCrawly.Spider.init/1
options. I have an SQS queue that gets filled with web crawling jobs by a recrawl scheduler, and I want to run crawlers on different machines. The SQS messages look like this:{"start_urls": ["http://example.org/"], "base_url": "http://example.org/"}
. It would be straightforward to get the options from the queue in the init function.