A high performance web crawler / scraper in Elixir, with worker pooling and rate limiting via OPQ.
See Hex documentation.
Below is a very high level architecture diagram demonstrating how Crawler works.
Crawler.crawl("http://elixir-lang.org", max_depths: 2)
There are several ways to access the crawled page data:
Crawler.Store
Crawler.Store.DB
:save_to
option is set, pages will be saved to disk in addition to the above mentioned placesOption | Type | Default Value | Description |
---|---|---|---|
:assets |
list | [] |
Whether to fetch any asset files, available options: "css" , "js" , "images" . |
:save_to |
string | nil |
When provided, the path for saving crawled pages. |
:workers |
integer | 10 |
Maximum number of concurrent workers for crawling. |
:interval |
integer | 0 |
Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit. |
:max_depths |
integer | 3 |
Maximum nested depth of pages to crawl. |
:max_pages |
integer | :infinity |
Maximum amount of pages to crawl. |
:timeout |
integer | 5000 |
Timeout value for fetching a page, in ms. Can also be set to :infinity , useful when combined with Crawler.pause/1 . |
:retries |
integer | 2 |
Number of times to retry a fetch. |
:store |
module | nil |
Module for storing the crawled page data and crawling metadata. You can set it to Crawler.Store or use your own module, see Crawler.Store.add_page_data/3 for implementation details. |
:force |
boolean | false |
Force crawling URLs even if they have already been crawled, useful if you want to refresh the crawled data. |
:scope |
term | nil |
Similar to :force , but you can pass a custom :scope to determine how Crawler should perform on links already seen. |
:user_agent |
string | Crawler/x.x.x (...) |
User-Agent value sent by the fetch requests. |
:url_filter |
module | Crawler.Fetcher.UrlFilter |
Custom URL filter, useful for restricting crawlable domains, paths or content types. |
:retrier |
module | Crawler.Fetcher.Retrier |
Custom fetch retrier, useful for retrying failed crawls, nullifies the :retries option. |
:modifier |
module | Crawler.Fetcher.Modifier |
Custom modifier, useful for adding custom request headers or options. |
:scraper |
module | Crawler.Scraper |
Custom scraper, useful for scraping content as soon as the parser parses it. |
:parser |
module | Crawler.Parser |
Custom parser, useful for handling parsing differently or to add extra functionalities. |
:encode_uri |
boolean | false |
When set to true apply the URI.encode to the URL to be crawled. |
:queue |
pid | nil |
You can pass in an OPQ pid so that multiple crawlers can share the same queue. |
It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:
Crawler uses ElixirRetry's exponential backoff strategy by default.
defmodule CustomRetrier do
@behaviour Crawler.Fetcher.Retrier.Spec
end
See Crawler.Fetcher.UrlFilter
.
defmodule CustomUrlFilter do
@behaviour Crawler.Fetcher.UrlFilter.Spec
end
See Crawler.Scraper
.
defmodule CustomScraper do
@behaviour Crawler.Scraper.Spec
end
See Crawler.Parser
.
defmodule CustomParser do
@behaviour Crawler.Parser.Spec
end
defmodule CustomModifier do
@behaviour Crawler.Fetcher.Modifier.Spec
end
Crawler provides pause/1
, resume/1
and stop/1
, see below.
{:ok, opts} = Crawler.crawl("https://elixir-lang.org")
Crawler.running?(opts) # => true
Crawler.pause(opts)
Crawler.running?(opts) # => false
Crawler.resume(opts)
Crawler.running?(opts) # => true
Crawler.stop(opts)
Crawler.running?(opts) # => false
Please note that when pausing Crawler, you would need to set a large enough :timeout
(or even set it to :infinity
) otherwise parser would timeout due to unprocessed links.
It is possible to start multiple crawlers sharing the same queue.
{:ok, queue} = OPQ.init(worker: Crawler.Dispatcher.Worker, workers: 2)
Crawler.crawl("https://elixir-lang.org", queue: queue)
Crawler.crawl("https://github.com", queue: queue)
Crawler.Store.all_urls() # => ["https://elixir-lang.org", "https://google.com", ...]
This example performs a Google search, then scrapes the results to find Github projects and output their name and description.
See the source code.
You can run the example by cloning the repo and run the command:
mix run -e "Crawler.Example.GoogleSearch.run()"
Please see https://hexdocs.pm/crawler.
Please see CHANGELOG.md.
Copyright (c) 2016 Fred Wu
This work is free. You can redistribute it and/or modify it under the terms of the MIT License.