fredwu / crawler

A high performance web crawler / scraper in Elixir.
939 stars 91 forks source link

Example Project for basic config #3

Closed cdesch closed 7 years ago

cdesch commented 7 years ago

I keep getting errors when dropping it into a blank project. What is the basic config for getting it up and running? Maybe an example might help or instructions on how to integrate into your project in the readme. If I figure it out, i'll make a pull request

Error:


iex(1)> Scraper.crawl
:ok
iex(2)> 
17:31:26.677 [error] GenServer #PID<0.318.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/getting-started/introduction.html", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/getting-started/introduction.html", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.677 [error] GenServer #PID<0.317.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/install.html", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/install.html", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.677 [error] GenServer #PID<0.315.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.678 [error] GenServer #PID<0.319.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/scraper/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/learning.html", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/learning.html", level: 1, max_levels: 3, max_depths: 2, level: 0]

17:31:26.678 [info]  Application crawler exited: shutdown
fredwu commented 7 years ago

Hi @cdesch, this is kind of "expected" for now because Crawler is under active development and has not yet reached the stable stage where it's usable. Over the next few weeks I will focus more on the the connection handling part. :)

cdesch commented 7 years ago

@fredwu Does Crawler need to be started in OTP as it's own worker or application?

like this:

 # Type "mix help compile.app" for more information
  def application do
    [applications: [
      :logger,
      :ecto,
      :postgrex,
     :crawler
      ]]
  end

or like this as an application:


defmodule MyApp.Application do
  @moduledoc false

  use Application

  # See http://elixir-lang.org/docs/stable/elixir/Application.html
  # for more information on OTP Applications
  def start(_type, _args) do
    import Supervisor.Spec, warn: false

    # Define workers and child supervisors to be supervised
    children = [
      supervisor(MyApp.Repo, []),
      supervisor(Crawler, [])
    ]

    # See http://elixir-lang.org/docs/stable/elixir/Supervisor.html
    # for other strategies and supported options
    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end
fredwu commented 7 years ago

As an application, like in your first example.

cdesch commented 7 years ago

Trying it out... still hitting some tough spots. I have to come back later to troubleshoot it.

Here is my progress I slapped together so far for just running bare bones. https://github.com/kickinespresso/crawler_example

issue:


iex(1)> CrawlerExample.crawl
:ok
iex(2)> 
13:26:40.766 [error] GenServer #PID<0.226.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/crawler_example/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/crawler_example/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/crawler_example/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/", level: 1, max_levels: 3, max_depths: 2, level: 0]

13:26:40.766 [error] GenServer #PID<0.228.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/crawler_example/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/crawler_example/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/crawler_example/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/install.html", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/install.html", level: 1, max_levels: 3, max_depths: 2, level: 0]

13:26:40.766 [error] GenServer #PID<0.229.0> terminating
** (ArgumentError) argument error
    (kernel) gen_tcp.erl:149: :gen_tcp.connect/4
    (hackney) /Users/cj/elixir_projects/crawler_example/deps/hackney/src/hackney_connect.erl:246: :hackney_connect.do_connect/5
    (hackney) /Users/cj/elixir_projects/crawler_example/deps/hackney/src/hackney_connect.erl:37: :hackney_connect.connect/5
    (hackney) /Users/cj/elixir_projects/crawler_example/deps/hackney/src/hackney.erl:315: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
Last message: {:"$gen_cast", [url: "/getting-started/introduction.html", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "/getting-started/introduction.html", level: 1, max_levels: 3, max_depths: 2, level: 0]

13:26:40.766 [error] GenServer #PID<0.255.0> terminating
** (ArgumentError) argument error
    :erlang.list_to_integer([])
    (hackney) /Users/cj/elixir_projects/crawler_example/deps/hackney/src/hackney_url.erl:204: :hackney_url.parse_netloc/2
    (hackney) /Users/cj/elixir_projects/crawler_example/deps/hackney/src/hackney.erl:331: :hackney.request/5
    (httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
    (crawler) lib/crawler/worker/fetcher.ex:47: Crawler.Worker.Fetcher.fetch_url/1
    (crawler) lib/crawler/worker.ex:12: Crawler.Worker.handle_cast/2
    (stdlib) gen_server.erl:601: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:667: :gen_server.handle_msg/5
Last message: {:"$gen_cast", [url: "irc://irc.freenode.net/elixir-lang", level: 1, max_levels: 3, max_depths: 2, level: 0]}
State: [url: "irc://irc.freenode.net/elixir-lang", level: 1, max_levels: 3, max_depths: 2, level: 0]

13:26:40.766 [info]  Application crawler exited: shutdown
fredwu commented 7 years ago

Your use of Crawler is fine, the errors are normal at this stage, as I haven't had a chance to address them yet.

cdesch commented 7 years ago

Ah, ill see if I can make a pull request later if I figure it out.

If you want to push your active working to a develop Branch, maybe I can help out

On Sat, Aug 19, 2017, 13:43 Fred Wu notifications@github.com wrote:

Your use of Crawler is fine, the errors are normal at this stage, as I haven't had a chance to correct them yet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fredwu/crawler/issues/3#issuecomment-323537476, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJt8zWggh2piRgyNpflvwOs-A15bBDIks5sZx7HgaJpZM4O8EaZ .

fredwu commented 7 years ago

Hi @cdesch, I've pushed a few updates, these errors should be gone now.

cdesch commented 7 years ago

Thanks @fredwu - it works!

I made a PR on some README.md in pull request #4

I got some errors thrown when running the test cases, although all the tests passed. Might be worth looking at.

cjsMBP15:crawler cj$ mix test
===> Compiling mimerl
===> Compiling metrics
===> Compiling unicode_util_compat
===> Compiling idna
==> gen_stage
Compiling 7 files (.ex)
Generated gen_stage app
===> Compiling ranch
==> ssl_verify_fun (compile)
Compiled src/ssl_verify_util.erl
Compiled src/ssl_verify_fingerprint.erl
Compiled src/ssl_verify_pk.erl
Compiled src/ssl_verify_hostname.erl
==> jsx
Compiling 9 files (.erl)
Generated jsx app
==> exjsx
Compiling 1 file (.ex)
Generated exjsx app
===> Compiling certifi
===> Compiling hackney
==> excoveralls
Compiling 22 files (.ex)
warning: String.to_char_list/1 is deprecated, use String.to_charlist/1
  lib/excoveralls/cover.ex:12

warning: String.to_char_list/1 is deprecated, use String.to_charlist/1
  lib/excoveralls/stats.ex:106

Generated excoveralls app
==> httpoison
Compiling 2 files (.ex)
Generated httpoison app
==> opq
Compiling 6 files (.ex)
Generated opq app
===> Compiling cowlib
src/cow_multipart.erl:392: Warning: crypto:rand_bytes/1 is deprecated and will be removed in a future release; use crypto:strong_rand_bytes/1

===> Compiling cowboy
===> Compiling mochiweb
src/mochiweb_session.erl:144: Warning: crypto:rand_bytes/1 is deprecated and will be removed in a future release; use crypto:strong_rand_bytes/1

src/mochiweb_multipart.erl:59: Warning: crypto:rand_bytes/1 is deprecated and will be removed in a future release; use crypto:strong_rand_bytes/1

==> floki
Compiling 1 file (.xrl)
Compiling 1 file (.erl)
Compiling 19 files (.ex)
Generated floki app
==> mime
Compiling 1 file (.ex)
warning: String.strip/1 is deprecated, use String.trim/1
  lib/mime.ex:28

Generated mime app
==> plug
Compiling 1 file (.erl)
Compiling 44 files (.ex)
Generated plug app
==> bypass
Compiling 5 files (.ex)
Generated bypass app
==> crawler
Compiling 20 files (.ex)
Generated crawler app
..................
12:14:19.580 [debug] Fetch failed 'not_fetched_yet?', with opts: [referrer_url: "http://localhost:64326/dir/page2.html", depth: 2, workers: 10, interval: 0, timeout: 5000, parser: Crawler.Parser, save_to: "/Users/cj/elixir_projects/crawler/test/tmp/integration", max_depths: 4, queue: {#PID<0.1648.0>, [worker: Crawler.Dispatcher.Worker, workers: 10, interval: 0, timeout: 5000, name: #PID<0.1648.0>, rate_limiter: #PID<0.1649.0>]}, url: "http://localhost:64328/page3.html"].

12:14:19.581 [debug] Fetch failed 'not_fetched_yet?', with opts: [referrer_url: "http://localhost:64328/page3.html", depth: 2, workers: 10, interval: 0, timeout: 5000, parser: Crawler.Parser, save_to: "/Users/cj/elixir_projects/crawler/test/tmp/integration", max_depths: 4, queue: {#PID<0.1648.0>, [worker: Crawler.Dispatcher.Worker, workers: 10, interval: 0, timeout: 5000, name: #PID<0.1648.0>, rate_limiter: #PID<0.1649.0>]}, url: "http://localhost:64328/dir/page4"].
.
12:14:19.586 [debug] Fetch failed 'within_fetch_depth?', with opts: [referrer_url: "http://localhost:64328/page5.html", depth: 4, workers: 10, interval: 0, timeout: 5000, parser: Crawler.Parser, save_to: "/Users/cj/elixir_projects/crawler/test/tmp/integration", max_depths: 4, queue: {#PID<0.1648.0>, [worker: Crawler.Dispatcher.Worker, workers: 10, interval: 0, timeout: 5000, name: #PID<0.1648.0>, rate_limiter: #PID<0.1649.0>]}, url: "http://localhost:64328/page6"].
.
12:14:19.610 [error] GenServer #PID<0.1671.0> terminating
** (stop) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (stdlib) gen.erl:261: :gen.do_for_proc/2
    (bypass) lib/bypass/instance.ex:298: Bypass.Instance.dispatch_awaiting_callers/1
    (bypass) lib/bypass/instance.ex:173: Bypass.Instance.do_handle_call/3
    (stdlib) gen_server.erl:615: :gen_server.try_handle_call/4
    (stdlib) gen_server.erl:647: :gen_server.handle_msg/5
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: {:put_expect_result, {"GET", "/timeout"}, #Reference<0.0.3.4393>, :ok_call}
State: %{callers_awaiting_down: [], callers_awaiting_exit: [{#PID<0.1786.0>, #Reference<0.0.6.5031>}], expectations: %{{"GET", "/"} => %{expected: :once, fun: #Function<6.19711896/1 in Crawler.FetcherTest."test success"/1>, request_count: 1, results: [:ok_call], retained_plugs: %{}}, {"GET", "/500"} => %{expected: :once, fun: #Function<0.19711896/1 in Crawler.FetcherTest."test failure: 500"/1>, request_count: 1, results: [:ok_call], retained_plugs: %{}}, {"GET", "/fail.html"} => %{expected: :once, fun: #Function<3.19711896/1 in Crawler.FetcherTest."test failure: unable to write"/1>, request_count: 1, results: [:ok_call], retained_plugs: %{}}, {"GET", "/page.html"} => %{expected: :once, fun: #Function<4.19711896/1 in Crawler.FetcherTest."test snap /page.html"/1>, request_count: 1, results: [:ok_call], retained_plugs: %{}}, {"GET", "/timeout"} => %{expected: :once, fun: #Function<1.19711896/1 in Crawler.FetcherTest."test failure: timeout"/1>, request_count: 1, results: [], retained_plugs: %{#Reference<0.0.3.4393> => #PID<0.1761.0>}}}, pass: false, port: 64332, ref: #Reference<0.0.7.3771>, socket: #Port<0.45598>, unknown_route_error: nil}
..............................................................
12:14:19.866 [debug] Fetch failed 'not_fetched_yet?', with opts: [referrer_url: "http://localhost:64354/link1", depth: 2, timeout: 5000, save_to: nil, parser: Crawler.Parser, max_depths: 3, workers: 3, interval: 100, queue: {#PID<0.2027.0>, [worker: Crawler.Dispatcher.Worker, workers: 3, interval: 100, timeout: 5000, name: #PID<0.2027.0>, rate_limiter: #PID<0.2028.0>]}, url: "http://localhost:64354/link2"].

12:14:19.867 [debug] Fetch failed 'within_fetch_depth?', with opts: [referrer_url: "http://localhost:64354/link3", depth: 3, timeout: 5000, save_to: nil, parser: Crawler.Parser, max_depths: 3, workers: 3, interval: 100, queue: {#PID<0.2027.0>, [worker: Crawler.Dispatcher.Worker, workers: 3, interval: 100, timeout: 5000, name: #PID<0.2027.0>, rate_limiter: #PID<0.2028.0>]}, url: "http://localhost:64354/link4"].
.

Finished in 0.6 seconds
83 tests, 0 failures

Randomized with seed 241388