Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
~> 1.14
Create a new project: mix new quickstart --sup
Add Crawly as a dependencies:
# mix.exs
defp deps do
[
{:crawly, "~> 0.17.2"},
{:floki, "~> 0.33.0"}
]
end
Fetch dependencies: $ mix deps.get
Create a spider
# lib/crawly_example/books_to_scrape.ex
defmodule BooksToScrape do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: "https://books.toscrape.com/"
@impl Crawly.Spider
def init() do
[start_urls: ["https://books.toscrape.com/"]]
end
@impl Crawly.Spider
def parse_item(response) do
# Parse response body to document
{:ok, document} = Floki.parse_document(response.body)
# Create item (for pages where items exists)
items =
document
|> Floki.find(".product_pod")
|> Enum.map(fn x ->
%{
title: Floki.find(x, "h3 a") |> Floki.attribute("title") |> Floki.text(),
price: Floki.find(x, ".product_price .price_color") |> Floki.text(),
url: response.request_url
}
end)
next_requests =
document
|> Floki.find(".next a")
|> Floki.attribute("href")
|> Enum.map(fn url ->
Crawly.Utils.build_absolute_url(url, response.request.url)
|> Crawly.Utils.request_from_url()
end)
%Crawly.ParsedItem{items: items, requests: next_requests}
end
end
New in 0.15.0 :
It's possible to use the command to speed up the spider creation, so you will have a generated file with all needed callbacks:
mix crawly.gen.spider --filepath ./lib/crawly_example/books_to_scrape.ex --spidername BooksToScrape
Configure Crawly
By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls:
(in file: config/config.exs
)
import Config
config :crawly,
closespider_timeout: 10,
concurrent_requests_per_domain: 8,
closespider_itemcount: 100,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
{Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
],
pipelines: [
{Crawly.Pipelines.Validate, fields: [:url, :title, :price]},
{Crawly.Pipelines.DuplicatesFilter, item_id: :title},
Crawly.Pipelines.JSONEncoder,
{Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
]
New in 0.15.0:
You can generate example config with the help of the following command:
mix crawly.gen.config
Start the Crawl:
iex -S mix run -e "Crawly.Engine.start_spider(BooksToScrape)"
Results can be seen with:
$ cat /tmp/BooksToScrape_<timestamp>.jl
It's possible to run Crawly in a standalone mode, when Crawly is running as a tiny docker container, and spiders are just YMLfiles or elixir modules that are mounted inside.
Please read more about it here:
Please use discussions for all conversations related to the project
Crawly can be configured in the way that all fetched pages will be browser rendered, which can be very useful if you need to extract data from pages which has lots of asynchronous elements (for example parts loaded by AJAX).
You can read more here:
Crawly provides a simple management UI by default on the localhost:4001
It allows to:
NOTE: It's possible to disable the Simple management UI (and rest API) with the
start_http_api?: false
options of Crawly configuration.
You can choose to run the management UI as a plug in your application.
defmodule MyApp.Router do
use Plug.Router
...
forward "/admin", Crawly.API.Router
...
end
Now don't have a possibility to work on experimental UI built with Phoenix and LiveViews, and keeping it here for mainly demo purposes.
The CrawlyUI project is an add-on that aims to provide an interface for managing and rapidly developing spiders. Checkout the code from GitHub
To be discussed
We would gladly accept your contributions!
Please find documentation on the HexDocs
Using Crawly on production? Please let us know about your case!
Copyright (c) 2019 Oleg Tarasenko
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
git commit && git tag 0.xx.0 && git push origin master --follow-tags
mix docs
mix hex.publish