fredwu / crawler

A high performance web crawler / scraper in Elixir.
939 stars 91 forks source link

Do not spawn a gen_server for ever worker run #11

Closed rhnonose closed 12 months ago

rhnonose commented 6 years ago

Worker is being spawned by OPQ's WorkerSupervisor which is a GenStage's ConsumerSupervisor and apparently it doesn't dispose of the children after the demand is met, so every GenServer started for every event still hangs there consuming the process count.

This PR removes that and runs the worker directly.

Testing with this modification, the memory consumption didn't surpass 500mb whereas before it would fill up my machine's memory.

Options used:

  workers: 10,
  interval: 500,
  timeout: 5000,
  max_depths: 5,

Solves #9

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-0.3%) to 98.837% when pulling e0ba256cb4f81e6edfa405febcbc5741ac007f27 on rhnonose:master into fedd5a0c27b345d1db5061304c9b7fdcdfea887c on fredwu:master.

fredwu commented 6 years ago

Hmm interesting, nice find! I do have one question though, what happens if a worker dies? Would the entire process stop?

rhnonose commented 6 years ago

@fredwu it's supervised by WorkerSupervisor which is a ConsumerSupervisor, one_for_one strategy. The restart is temporary so it'll not be restarted, but that's on OPQ's side.

fredwu commented 6 years ago

Thanks @rhnonose! I'll need to fix up the test suite (broken due to newer version of Elixir, unrelated to this PR) and after that I'll merge this one in.

robvolk commented 6 years ago

thanks so much @rhnonose!

I ran my crawler with your fork and I don't see it actually crawling pages. Is it working for you locally?

rhnonose commented 6 years ago

@robvolk it seems like it, but I'll take a closer look

rhnonose commented 6 years ago

@robvolk I had to increase the number of workers and depth a little bit to get similar numbers of results as before (workers: 30, max_depth: 5), but other than that it's working fine.

robvolk commented 6 years ago

When I run the crawler with your change, it starts crawling, but then stops pretty quickly. I can't figure out why. I feel like it's now actually calling the parser, but I don't know for sure why it's stopping.

rhnonose commented 6 years ago

@robvolk send me an e-mail with how you're running the crawler, maybe I can help.

fredwu commented 6 years ago

The test suite is now fixed on master. Any more info on @robvolk's issue on this branch?

rhnonose commented 6 years ago

@fredwu nothing Merging with master, it still breaks 3 tests. I'll take a look to fix it.

robvolk commented 6 years ago

@rhnonose 's changes work! Thanks so much for fixing the crawler!! Now I'm able to crawl a site deep and my server doesn't run out of memory.

Had to use both of these to get it to work:

{:crawler, github: "rhnonose/crawler"},
{:opq, github: "rhnonose/opq", override: true},
mazz commented 6 years ago

@robvolk is master fixed yet or still have to use @rhnonose’s branch to not run out of memory on large sites?

rhnonose commented 6 years ago

@mazz you can try using the latest version of opq which I addresses some similar issues, the PR was merged but I'm not sure if there's a patch in hex

mazz commented 6 years ago

@rhnonose ok I'll just try

{:crawler, github: "rhnonose/crawler"},
{:opq, github: "rhnonose/opq", override: true},
robvolk commented 6 years ago

In addition to the latest crawler at master, I’m using latest OPQ On Wed, Aug 15, 2018 at 7:33 PM Michael notifications@github.com wrote:

@rhnonose https://github.com/rhnonose ok I'll just try

{:crawler, github: "rhnonose/crawler"}, {:opq, github: "rhnonose/opq", override: true},

above

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/fredwu/crawler/pull/11#issuecomment-413383158, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkWzNpTJ7IoRetrtZBvCQx3QfCiw_XTks5uRL3DgaJpZM4ULChg .

-- Rob

mazz commented 6 years ago

@robvolk what does your mix.exs look like? when I use

      {:crawler, "~> 1.0"},
      {:opq, "~> 3.1"}

I see a deps conflict:

Failed to use "opq" (version 3.1.0) because
  crawler (version 1.0.0) requires ~> 2.0
  mix.exs specifies ~> 3.1

another issue: when I was using

      {:crawler, github: "rhnonose/crawler"},
      {:opq, github: "rhnonose/opq", override: true}

and would call iex(1)> Crawler.crawl("http://elixir-lang.org", max_depths: 2)

I would never get a successful fetch:

14:12:49.613 [debug] "Fetch failed 'not_fetched_yet?', with opts: %{assets: [], content_type: \"text/html\", depth: 1, encode_uri: false, headers: [{\"Server\", \"GitHub.com\"}, {\"Content-Type\", \"text/html; charset=utf-8\"}, {\"Last-Modified\", \"Tue, 14 Aug 2018 11:12:02 GMT\"}, {\"Access-Control-Allow-Origin\", \"*\"}, {\"Expires\", \"Thu, 16 Aug 2018 02:22:07 GMT\"}, {\"Cache-Control\", \"max-age=600\"}, {\"X-GitHub-Request-Id\", \"C724:518F:1E94E89:2891DED:5B74DD74\"}, {\"Content-Length\", \"20094\"}, {\"Accept-Ranges\", \"bytes\"}, {\"Date\", \"Thu, 16 Aug 2018 18:12:49 GMT\"}, {\"Via\", \"1.1 varnish\"}, {\"Age\", \"0\"}, {\"Connection\", \"keep-alive\"}, {\"X-Served-By\", \"cache-cmh8822-CMH\"}, {\"X-Cache\", \"HIT\"}, {\"X-Cache-Hits\", \"1\"}, {\"X-Timer\", \"S1534443169.476741,VS0,VE17\"}, {\"Vary\", \"Accept-Encoding\"}, {\"X-Fastly-Request-ID\", \"193a22bc2516efc319c0777a5146f158340e81f8\"}], html_tag: \"a\", interval: 0, max_depths: 2, modifier: Crawler.Fetcher.Modifier, parser: Crawler.Parser, queue: #PID<0.207.0>, referrer_url: \"http://elixir-lang.org\", retrier: Crawler.Fetcher.Retrier, save_to: nil, scraper: Crawler.Scraper, timeout: 5000, url: \"http://elixir-lang.org\", url_filter: Crawler.Fetcher.UrlFilter, user_agent: \"Crawler/1.0.0 (https://github.com/fredwu/crawler)\", workers: 10}."
robvolk commented 6 years ago

Here's my mix.exs @mazz. I'm not sure if I need to reference @rhnonose's OPQ directly anymore, but I did need it back when he created his branch for crawler.

      {:crawler, git: "https://github.com/fredwu/crawler.git"},
      {:opq, github: "rhnonose/opq", override: true},
Awlexus commented 6 years ago

I think worker.ex doesn't need use GenServer anymore

robvolk commented 5 years ago

I did some more digging and if I use this branch, then it does indeed fix the memory issue. It introduces a new issue though - URLs will be crawled over and over and over.

For example, URLs in a page's header will get scraped and crawled over and over. I'm not sure why. It looks like they might not be getting marked as processed anymore in the Store. Any ideas?

fredwu commented 5 years ago

I'm not sure if I need to reference @rhnonose's OPQ directly anymore, but I did need it back when he created his branch for crawler.

This is no longer needed, as the changes in @rhnonose's fork has been merged into the main repo.

I did some more digging and if I use this branch, then it does indeed fix the memory issue. It introduces a new issue though - URLs will be crawled over and over and over.

For example, URLs in a page's header will get scraped and crawled over and over. I'm not sure why. It looks like they might not be getting marked as processed anymore in the Store. Any ideas?

I'd love others to help out on this issue, as I'm not using Crawler actively any more it's difficult for me to spend enough time and energy on debugging this issue. I'm happy to incorporate fixes and/or enhancements once they're proven to work. :)

robvolk commented 5 years ago

Thanks for the reply. If @rhnonose 's changes were merged in, why am I seeing different behavior when I reference his fork and yours? When I reference your main branch, the memory climbs linearly, but when I reference his, it grows to a point, then stays flat (as resources are released).

rhnonose commented 5 years ago

There's a bunch of changes since my fork, it might be something new. Also, opq is not the latest.

robvolk commented 5 years ago

With your new changes have you noticed pages are scraped multiple times? On Mon, Oct 1, 2018 at 12:29 PM Rodrigo Nonose notifications@github.com wrote:

There's a bunch of changes since my fork, it might be something new. Also, opq is not the latest.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fredwu/crawler/pull/11#issuecomment-425993193, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkWzN59SYJuzz6xCA4Ohid7OFCU1dubks5uglDjgaJpZM4ULChg .

--

Rob Volk

Foxbox / Founder

Inspiring app development

foxbox.co

516.238.9982

Read: How We Create Affordable Mobile Apps for Businesses https://foxbox.co/blog/how-we-create-affordable-mobile-apps-for-businesses/

rhnonose commented 5 years ago

I don't use the crawler in a while unfortunately.

adamdottv commented 5 years ago

Where did things settle here? I'm in need of an Elixir crawler and am willing to devote the bandwidth needed to resolve this issue, but want to make sure I'm starting with proper assumptions.

fredwu commented 5 years ago

@adamelmore Hmm, I don't believe there's been any progress on this...

adamdottv commented 5 years ago

Was there consensus on the approach taken in this PR, and we just need to iron out the kinks (pages being scraped multiple times)? Or should I approach #9 from scratch?

rhnonose commented 5 years ago

@adamelmore we ran our crawler without issues a couple of times If you want to check the exact dependencies: https://github.com/emcasa/hammock I had to use forks and, since we're not using it anymore, I didn't bother fixing it later when PRs got merged

adamdottv commented 5 years ago

Is that a private repo, maybe? Getting a 404.

rhnonose commented 5 years ago

My apologies @adamelmore , we retired and made private I think Here are the deps

{:crawler, github: "rhnonose/crawler"},
{:opq, github: "rhnonose/opq", override: true},
fredwu commented 12 months ago

Hey folks, it's been a loooong time. I'm picking up Crawler again, and have made this commit to improve the memory usage: https://github.com/fredwu/crawler/commit/66969f8b58f6e121606ce22f4348723becbe830b