elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
953 stars 112 forks source link

Set base_url in init options instead of callback #296

Closed adonig closed 2 months ago

adonig commented 2 months ago

I'm wondering how hard it would be to make it possible to set the base_url through the Crawly.Spider.init/1 options. I have an SQS queue that gets filled with web crawling jobs by a recrawl scheduler, and I want to run crawlers on different machines. The SQS messages look like this: {"start_urls": ["http://example.org/"], "base_url": "http://example.org/"}. It would be straightforward to get the options from the queue in the init function.

adonig commented 2 months ago

Oh I see, I can just use the SameDomainFilter middleware, because it doesn't need the base_url.