elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
976 stars 115 forks source link

Genserver time out crash in long-running pipeline #230

Closed starcraft66 closed 1 year ago

starcraft66 commented 2 years ago

I wrote a crawler that crawls media galleries to download them and download files uing HTTPoison in a custom pipeline. This seems to work fine for small files as they download quickly but if I try to crawl a really big file like a video, I will almost always get this crash:

12:49:51.241 [error] GenServer #PID<0.471.0> terminating
** (stop) exited in: GenServer.call(Crawly.DataStorage, {:store, KemonoCrawler, %{...}, 5000)
    ** (EXIT) exited in: GenServer.call(#PID<0.468.0>, :stats, 5000)
        ** (EXIT) time out
    (elixir 1.14.0) lib/gen_server.ex:1038: GenServer.call/3
    (elixir 1.14.0) lib/enum.ex:975: Enum."-each/2-lists^foreach/1-0-"/2
    (crawly 0.14.0) lib/crawly/worker.ex:173: Crawly.Worker.process_parsed_item/1
    (crawly 0.14.0) lib/crawly/worker.ex:50: Crawly.Worker.handle_info/2
    (stdlib 4.1) gen_server.erl:1123: :gen_server.try_dispatch/4
    (stdlib 4.1) gen_server.erl:1200: :gen_server.handle_msg/6
    (stdlib 4.1) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message: :work
State: %Crawly.Worker{backoff: 10000, spider_name: KemonoCrawler, crawl_id: "e8e7bb02-465f-11ed-85df-3a9adacac61c"}

Is there a setting I can tweak to allow my pipeline to run for much longer? I didn't see anything easily tweakable in the config.

Ziinc commented 1 year ago

I would suggest firing off a Task instead or moving such heavy data processing to a separate process tree. Pipelines was designed for lightweight data processing.