elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
976 stars 115 forks source link

js render problem #188

Closed ziyouchutuwenwu closed 3 years ago

ziyouchutuwenwu commented 3 years ago

hi, i made a sample as demo, code like this

defmodule EslSpider do
  use Crawly.Spider

  alias Crawly.Utils

  @impl Crawly.Spider
  def base_url(), do: "https://www.erlang-solutions.com"

  @impl Crawly.Spider
  def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog/"]]

  @impl Crawly.Spider
  def parse_item(response) do
    {:ok, document} = Floki.parse_document(response.body)
    hrefs = document |> Floki.find("a.btn-link") |> Floki.attribute("href")

    requests =
      Utils.build_absolute_urls(hrefs, base_url())
      |> Utils.requests_from_urls()

    title = document |> Floki.find("h1.page-title-sm") |> Floki.text()

    %{
      :requests => requests,
      :items => [%{title: title, url: response.request_url}]
    }
  end
end

config/config.exs

use Mix.Config

config :crawly,
  fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html"]},
  closespider_timeout: 3600,
  concurrent_requests_per_domain: 8,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    {Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
  ],
  pipelines: [
    {Crawly.Pipelines.Validate, fields: [:url, :title]},
    {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
    Crawly.Pipelines.JSONEncoder,
    {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
  ]

i start splash with cmd

docker run -it --rm -d -p 8050:8050 scrapinghub/splash --max-timeout 3600

the iex shell output is

iex(2)> Crawly.Engine.start_spider(EslSpider)

09:14:47.911 [debug] Starting the manager for Elixir.EslSpider

09:14:47.915 [debug] Starting requests storage worker for Elixir.EslSpider...

09:14:47.917 [debug] Started 8 workers for Elixir.EslSpider
:ok
iex(3)> 
09:14:53.926 [debug] Request to https://www.erlang-solutions.com/blog/, is scheduled for retry

09:14:53.927 [debug] Dropping request: https://www.erlang-solutions.com/blog/, as it's already processed

09:14:53.927 [debug] Crawly worker could not process the request to "https://www.erlang-solutions.com/blog/"
                  reason: %HTTPoison.Error{id: nil, reason: :timeout}

if i remove splash fetcher from config , i can get the response, but a bit slowly.

i run docker exec -it xxx bash into splash container. then execute curl --head https://www.erlang-solutions.com/blog, after about 15 seconds, i got the response.

modify config.exs

fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html", wait: 60]},

i got

Started ElixirLS debugger v0.7.0
Elixir version: "1.12.1 (compiled with Erlang/OTP 23)"
Erlang version: "23"
ElixirLS compiled with Elixir 1.8.2 and erlang 21
Compiling 2 files (.ex)
Generated demo app

13:46:15.003 [debug] Starting data storage

13:46:17.569 [debug] Starting the manager for Elixir.EslSpider

13:46:17.570 [debug] Starting requests storage worker for Elixir.EslSpider...

13:46:17.571 [debug] Started 8 workers for Elixir.EslSpider

13:46:18.603 [info]  Dropping item: %{title: "", url: "https://www.erlang-solutions.com/blog/"}. Reason: missing required fields

13:47:17.574 [info]  Current crawl speed is: 0 items/min

13:47:17.575 [info]  Stopping EslSpider, itemcount timeout achieved

great thanks.

oltarasenko commented 3 years ago

But why after all you need splash for that page? E.g. I can easily get all the data with:

response = Crawly.fetch("https://www.erlang-solutions.com/blog/")
%HTTPoison.Response{
  body: "<!DOCTYPE html>\n<html lang=\"en-US\" class=\"no-js\">\n\n<head>\n  <meta charset=\"UTF-8\">\n  <meta content=\"width=device-width, initial-scale=1.0, maximum-scale=1\" name=\"viewport\">\n  <link rel=\"apple-touch-icon\" sizes=\"180x180\" href=\"https://www.erlang-solutions.com/wp-content/themes/nopio_master_theme/assets/images/favicon/apple-touch-icon.png\">\n  <link rel=\"icon\" type=\"image/png\" sizes=\"32x32\" href=\"https://www.erlang-solutions.com/wp-content/themes/nopio_master_theme/assets/images/favicon/favicon-32x32.png\">\n  <link rel=\"icon\" type=\"image/png\" sizes=\"16x16\" href=\"https://www.erlang-solutions.com/wp-content/themes/nopio_master_theme/assets/images/favicon/favicon-16x16.png\">\n  <meta name=\"msapplication-TileColor\" content=\"#ffffff\">\n  <meta name=\"theme-color\" content=\"#ffffff\">\n\t\n  \t<!-- Google Optimize --> \t\n\t<script src=\"https://www.googleoptimize.com/optimize.js?id=OPT-5TG4NK6\"></script>\n\t<!-- Google Optimize --> \t\n\t\n  <!-- Google Tag Manager -->\n  <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':\n  new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],\n  j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=\n  'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);\n  })(window,document,'script','dataLayer','GTM-KTN9QLQ');</script>\n  <!-- End Google Tag Manager -->\n  <meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />\n\n\t<!-- This site is optimized with the Yoast SEO plugin v16.2 - https://yoast.com/wordpress/plugins/seo/ -->\n\t<link media=\"all\" href=\"https://www.erlang-solutions.com/wp-content/cache/autoptimize/css/autoptimize_56f2446d64420c1896236903a8d9bd90.css\" rel=\"stylesheet\" /><title>Erlang, Elixir &amp; RabbitMQ resources - Erlang Solutions Blog</title>\n\t<meta name=\"description\" content=\"As supporters of open source tech and Erlang, Elixir, OTP on the BEAM, we share our knowledge and insights with the Community so that we can all grow.\" />\n\t<link rel=\"canonical\" href=\"https://www.erlang-solutions.com/blog/\" />\n\t<link rel=\"next\" href=\"https://www.erlang-solutions.com/blog/page/2/\" />\n\t<meta property=\"og:locale\" content=\"en_US\" />\n\t<meta property=\"og:type\" content=\"article\" />\n\t<meta property=\"og:title\" content=\"Erlang, Elixir &amp; RabbitMQ resources - Erlang Solutions Blog\" />\n\t<meta property=\"og:description\" content=\"As supporters of open source tech and Erlang, Elixir, OTP on the BEAM, we share our knowledge and insights with the Community so that we can all grow.\" />\n\t<meta property=\"og:url\" content=\"https://www.erlang-solutions.com/blog/\" />\n\t<meta property=\"og:site_name\" content=\"Erlang Solutions\" />\n\t<meta property=\"og:image\" content=\"https://www.erlang-solutions.com/wp-content/uploads/2021/02/Transparent-Erlang-Solutions-Logo-Black.png\" />\n\t<meta property=\"og:image:width\" content=\"8001\" />\n\t<meta property=\"og:image:height\" content=\"4500\" />\n\t<meta name=\"twitter:card\" content=\"summary_large_image\" />\n\t<meta name=\"twitter:site\" content=\"@ErlangSolutions\" />\n\t<script type=\"application/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https://schema.org\",\"@graph\":[{\"@type\":\"Organization\",\"@id\":\"https://www.erlang-solutions.com/#organization\",\"name\":\"Erlang Solutions\",\"url\":\"https://www.erlang-solutions.com/\",\"sameAs\":[\"https://www.facebook.com/ErlangSolutions/\",\"https://www.linkedin.com/company/erlangsolutions\",\"https://www.youtube.com/c/ErlangSolutions\",\"https://twitter.com/ErlangSolutions\"],\"logo\":{\"@type\":\"ImageObject\",\"@id\":\"https://www.erlang-solutions.com/#logo\",\"inLanguage\":\"en-US\",\"url\":\"https://www.erlang-solutions.com/wp-content/uploads/2021/02/Transparent-Erlang-Solutions-Logo-Black.png\",\"contentUrl\":\"https://www.erlang-solutions.com/wp-content/uploads/2021/02/Transparent-Erlang-Solutions-Logo-Black.png\",\"width\":8001,\"height\":4500,\"caption\":\"Erlang Solutions\"},\"image\":{\"@id\":\"https://www.erlang-solutions.com/#logo\"}},{\"@type\":\"WebSite\",\"@id\":\"https://www.erlang-solutions.com/#website\",\"url\":\"https://www.erlang-solutions.com/\",\"name\":\"Erlang Solutions\",\"description\":\"Buildin" <> ...,

{:ok, document} = Floki.parse_document(response.body)
{:ok,
 [
   {"html", [{"lang", "en-US"}, {"class", "no-js"}],
    [
      {"head", [],
       [
         {"meta", [{"charset", "UTF-8"}], []},
         {"meta",
          [
            {"content",
             "width=

ex(8)> hrefs = document |> Floki.find("a.btn-link") |> Floki.attribute("href")
["https://www.erlang-solutions.com/blog/the-future-for-erlang-solutions/",
 "https://www.erlang-solutions.com/blog/fintech-matters-newsletter-1-july-2021/",
 "https://www.erlang-solutions.com/blog/blockchain-fintech-and-the-beam/",
 "https://www.erlang-solutions.com/blog/5-erlang-and-elixir-use-cases-in-fintech/",
 "https://www.erlang-solutions.com/blog/lessons-fintech-can-learn-from-telecom-part-two/",
 "https://www.erlang-solutions.com/blog/how-to-use-rabbitmq-in-service-integration/",
 "https://www.erlang-solutions.com/blog/lessons-fintech-can-learn-from-telecom-part-one/",
 "https://www.erlang-solutions.com/blog/erlang-solutions-partners-with-cockroach-labs/",
 "https://www.erlang-solutions.com/blog/how-to-ensure-your-instant-messaging-solution-offers-users-privacy-and-security/",
 "https://www.erlang-solutions.com/blog/fintech-client-case-studies-erlang-solutions-and-trifork/"]
ziyouchutuwenwu commented 3 years ago

thanks for your quick help, this is just for example. this url does not need splash, indeed

i just try splash, but find it not working..........

oltarasenko commented 3 years ago

Sorry I was not using splash for some time already, so it might be a bit hard for me to advise here.

ziyouchutuwenwu commented 3 years ago

thanks, so for the js part, is there any new solution? or just keep it currently?

oltarasenko commented 3 years ago

I am looking towards chrome headless, as it seemed to be way more reliable. E.g. I suggest looking on: https://oltarasenko.medium.com/building-a-chrome-based-fetcher-for-crawly-a779e9a8d9d0?sk=2dbb4d39cdf319f01d0fa7c05f9dc9ec

I did not have time to add it to Crawly yet, as I had to work on another commercial product, and don't have a chance to contribute :(

ziyouchutuwenwu commented 3 years ago

thank you very much!