I am writing an elixir web scraper using the "Crawly" module. I have created 2 modules. The first module named "ManufacturersToScrape"
collects a list of urls that it passes to the second module named "ModelsToScrape". The source code for both modules is shown
below under headings corresponding to the module source files. I've done this to show that the module names are consistent with
the source file names. Both modules compile without errors, and I know that the "ManufacturersToScrape" module collects well
formed urls as expected. However, at runtime I see multiple pipeline errors similar to this:
"[error] Pipeline crash by call: Crawly.Middlewares.UniqueRequest.run(%Crawly.Request{url: [url: "https://www.anchorvans.co.uk/specifications/vauxhall/"
The errors show that urls are being created correctly. The application appears to be failing when the variable "next_requests"
is passed to module "ModelsToScrape". This module doesn't do much. I just wanted to prove that the urls are being passed in correctly,
and clearly they are not.
I have provided an execution log below under heading "execution log".
@impl Crawly.Spider
@doc """
Extract items and requests to follow from the given response
"""
def parse_item(response) do
{:ok, document} = Floki.parse_document(response.body)
Specifications - Anchor Vans\n <meta property=\"og:locale\" content=\"en_GB\" />\n <meta property=\"og:type\" content=\"article\" />\n <meta property=\"og:description\" content=\"Place holder for list of makes - template for this page shows the top level of van_make_model taxonomy\" />\n <meta name=\"description\" content=\"Place holder for list of makes - template for this page shows the top level of van_make_model taxonomy\">\n <meta property=\"og:image\" content=\"//www.anchorvans.co.uk/img/logo.png\">\n <meta property=\"og:image:type\" content=\"image/png\">\n <meta property=\"og:image:width\" content=\"358\">\n <meta property=\"og:image:height\" content=\"148\">\n <meta property=\"og:url\" content=\"https://www.anchorvans.co.uk/specifications/\" />\n <meta property=\"og:site_name\" content=\"Anchor Vans\" />\n <meta property=\"og:title\" content=\"Specifications\" />\n\n \n <link rel=\"stylesheet\" href=\"/css/jquery-ui-1.10.2.custom.min.css\">\n <link rel=\"stylesheet\" media=\"screen\" href=\"/css/anchorvans.css\">\n \n <script src=\"/js/modernizr-2.6.2.min.js\">\n <script src=\"/js/prefixfree.min.js\" type=\"text/javascript\">\n <script src=\"/js/jquery-1.9.1.min.js\">\n <script src=\"/js/jquery-ui-1.10.1.custom.min.js\">\n <script src=\"/js/jquery.tablesorter.min.js\">\n <script src=\"/js/jquery.leanModal.min.js\">\n <script src=\"/js/jquery.waitforimages.min.js\">\n <script src=\"/js/jquery.cookie.js\">\n <script src=\"/js/anchor.min.js\">\n <script src=\"/js/teammembers.js\">\n \t\t\n\t\t <link rel=\"alternate\" type=\"application/rss+xml\" title=\"Used Vans RSS feed\" href=\"/vans/rss.xml\" />\n <link href=\"/favicon.ico\" rel=\"Shortcut Icon\" type=\"image/vnd.microsoft.icon\" />\n \n <body class=\"specifications\" itemscope itemtype=\"http://data-vocabulary.org/Organization\">\n <div class=\"widget-ui\">\n <div style=\"display: none\" id=\"cookie-message\" class=\"cookie-message ui-widget-header\">\n <p title=\"This site uses cookies. By continuing to use the site you give consent for us to set cookies in your browser\">\n This site uses cookies. By continuing to use the site you give consent for us to set cookies in your browser.\n <a class=\"cookie-message-close\" href=\"#cookie-message\">Hide\n \n
I am writing an elixir web scraper using the "Crawly" module. I have created 2 modules. The first module named "ManufacturersToScrape" collects a list of urls that it passes to the second module named "ModelsToScrape". The source code for both modules is shown below under headings corresponding to the module source files. I've done this to show that the module names are consistent with the source file names. Both modules compile without errors, and I know that the "ManufacturersToScrape" module collects well formed urls as expected. However, at runtime I see multiple pipeline errors similar to this:
The errors show that urls are being created correctly. The application appears to be failing when the variable "next_requests" is passed to module "ModelsToScrape". This module doesn't do much. I just wanted to prove that the urls are being passed in correctly, and clearly they are not.
I have provided an execution log below under heading "execution log".
Firstly, please can you confirm whether the call:
is the correct approach. Using iex, and passing a list to the Enum above returns the following data structure which looks good to me:
models_to_scrape.ex
defmodule ModelsToScrape do use Crawly.Spider
@impl Crawly.Spider def base_url(), do: "https://www.anchorvans.co.uk/"
@impl Crawly.Spider def init() do [start_urls: ["https://www.anchorvans.co.uk/"]] end
@impl Crawly.Spider @doc """ Extract items and requests to follow from the given response """ def parse_item(response) do {:ok, document} = Floki.parse_document(response.body)
end end
execution log
. . . 09:55:57.027 [error] Pipeline crash by call: Crawly.Middlewares.UniqueRequest.run(%Crawly.Request{url: [url: "https://www.anchorvans.co.uk/specifications/volkswagen/", spider: ModelsToScrape], headers: [], prev_response: %HTTPoison.Response{status_code: 200, body: "<!DOCTYPE html>\n\n\n\n\n <html lang=\"en\" prefix=\"fb: http://www.facebook.com/2008/fbml\" class=\"no-js\"> \n <head prefix=\"og: http://ogp.me/ns# object: http://ogp.me/ns/object#\">\n <meta charset=\"utf-8\">\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, maximum-scale=1\">\n