Closed rhnonose closed 12 months ago
Hmm interesting, nice find! I do have one question though, what happens if a worker dies? Would the entire process stop?
@fredwu it's supervised by WorkerSupervisor
which is a ConsumerSupervisor
, one_for_one
strategy. The restart is temporary
so it'll not be restarted, but that's on OPQ
's side.
Thanks @rhnonose! I'll need to fix up the test suite (broken due to newer version of Elixir, unrelated to this PR) and after that I'll merge this one in.
thanks so much @rhnonose!
I ran my crawler with your fork and I don't see it actually crawling pages. Is it working for you locally?
@robvolk it seems like it, but I'll take a closer look
@robvolk I had to increase the number of workers and depth a little bit to get similar numbers of results as before (workers: 30
, max_depth: 5
), but other than that it's working fine.
When I run the crawler with your change, it starts crawling, but then stops pretty quickly. I can't figure out why. I feel like it's now actually calling the parser, but I don't know for sure why it's stopping.
@robvolk send me an e-mail with how you're running the crawler, maybe I can help.
The test suite is now fixed on master. Any more info on @robvolk's issue on this branch?
@fredwu nothing Merging with master, it still breaks 3 tests. I'll take a look to fix it.
@rhnonose 's changes work! Thanks so much for fixing the crawler!! Now I'm able to crawl a site deep and my server doesn't run out of memory.
Had to use both of these to get it to work:
{:crawler, github: "rhnonose/crawler"},
{:opq, github: "rhnonose/opq", override: true},
@robvolk is master fixed yet or still have to use @rhnonose’s branch to not run out of memory on large sites?
@mazz you can try using the latest version of opq
which I addresses some similar issues, the PR was merged but I'm not sure if there's a patch in hex
@rhnonose ok I'll just try
{:crawler, github: "rhnonose/crawler"},
{:opq, github: "rhnonose/opq", override: true},
In addition to the latest crawler at master, I’m using latest OPQ On Wed, Aug 15, 2018 at 7:33 PM Michael notifications@github.com wrote:
@rhnonose https://github.com/rhnonose ok I'll just try
{:crawler, github: "rhnonose/crawler"}, {:opq, github: "rhnonose/opq", override: true},
above
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/fredwu/crawler/pull/11#issuecomment-413383158, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkWzNpTJ7IoRetrtZBvCQx3QfCiw_XTks5uRL3DgaJpZM4ULChg .
-- Rob
@robvolk what does your mix.exs look like? when I use
{:crawler, "~> 1.0"},
{:opq, "~> 3.1"}
I see a deps conflict:
Failed to use "opq" (version 3.1.0) because
crawler (version 1.0.0) requires ~> 2.0
mix.exs specifies ~> 3.1
another issue: when I was using
{:crawler, github: "rhnonose/crawler"},
{:opq, github: "rhnonose/opq", override: true}
and would call
iex(1)> Crawler.crawl("http://elixir-lang.org", max_depths: 2)
I would never get a successful fetch:
14:12:49.613 [debug] "Fetch failed 'not_fetched_yet?', with opts: %{assets: [], content_type: \"text/html\", depth: 1, encode_uri: false, headers: [{\"Server\", \"GitHub.com\"}, {\"Content-Type\", \"text/html; charset=utf-8\"}, {\"Last-Modified\", \"Tue, 14 Aug 2018 11:12:02 GMT\"}, {\"Access-Control-Allow-Origin\", \"*\"}, {\"Expires\", \"Thu, 16 Aug 2018 02:22:07 GMT\"}, {\"Cache-Control\", \"max-age=600\"}, {\"X-GitHub-Request-Id\", \"C724:518F:1E94E89:2891DED:5B74DD74\"}, {\"Content-Length\", \"20094\"}, {\"Accept-Ranges\", \"bytes\"}, {\"Date\", \"Thu, 16 Aug 2018 18:12:49 GMT\"}, {\"Via\", \"1.1 varnish\"}, {\"Age\", \"0\"}, {\"Connection\", \"keep-alive\"}, {\"X-Served-By\", \"cache-cmh8822-CMH\"}, {\"X-Cache\", \"HIT\"}, {\"X-Cache-Hits\", \"1\"}, {\"X-Timer\", \"S1534443169.476741,VS0,VE17\"}, {\"Vary\", \"Accept-Encoding\"}, {\"X-Fastly-Request-ID\", \"193a22bc2516efc319c0777a5146f158340e81f8\"}], html_tag: \"a\", interval: 0, max_depths: 2, modifier: Crawler.Fetcher.Modifier, parser: Crawler.Parser, queue: #PID<0.207.0>, referrer_url: \"http://elixir-lang.org\", retrier: Crawler.Fetcher.Retrier, save_to: nil, scraper: Crawler.Scraper, timeout: 5000, url: \"http://elixir-lang.org\", url_filter: Crawler.Fetcher.UrlFilter, user_agent: \"Crawler/1.0.0 (https://github.com/fredwu/crawler)\", workers: 10}."
Here's my mix.exs
@mazz. I'm not sure if I need to reference @rhnonose's OPQ
directly anymore, but I did need it back when he created his branch for crawler
.
{:crawler, git: "https://github.com/fredwu/crawler.git"},
{:opq, github: "rhnonose/opq", override: true},
I think worker.ex
doesn't need use GenServer
anymore
I did some more digging and if I use this branch, then it does indeed fix the memory issue. It introduces a new issue though - URLs will be crawled over and over and over.
For example, URLs in a page's header will get scraped and crawled over and over. I'm not sure why. It looks like they might not be getting marked as processed
anymore in the Store. Any ideas?
I'm not sure if I need to reference @rhnonose's
OPQ
directly anymore, but I did need it back when he created his branch forcrawler
.
This is no longer needed, as the changes in @rhnonose's fork has been merged into the main repo.
I did some more digging and if I use this branch, then it does indeed fix the memory issue. It introduces a new issue though - URLs will be crawled over and over and over.
For example, URLs in a page's header will get scraped and crawled over and over. I'm not sure why. It looks like they might not be getting marked as
processed
anymore in the Store. Any ideas?
I'd love others to help out on this issue, as I'm not using Crawler actively any more it's difficult for me to spend enough time and energy on debugging this issue. I'm happy to incorporate fixes and/or enhancements once they're proven to work. :)
Thanks for the reply. If @rhnonose 's changes were merged in, why am I seeing different behavior when I reference his fork and yours? When I reference your main branch, the memory climbs linearly, but when I reference his, it grows to a point, then stays flat (as resources are released).
There's a bunch of changes since my fork, it might be something new.
Also, opq
is not the latest.
With your new changes have you noticed pages are scraped multiple times? On Mon, Oct 1, 2018 at 12:29 PM Rodrigo Nonose notifications@github.com wrote:
There's a bunch of changes since my fork, it might be something new. Also, opq is not the latest.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fredwu/crawler/pull/11#issuecomment-425993193, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkWzN59SYJuzz6xCA4Ohid7OFCU1dubks5uglDjgaJpZM4ULChg .
--
Rob Volk
Foxbox / Founder
Inspiring app development
foxbox.co
516.238.9982
Read: How We Create Affordable Mobile Apps for Businesses https://foxbox.co/blog/how-we-create-affordable-mobile-apps-for-businesses/
I don't use the crawler in a while unfortunately.
Where did things settle here? I'm in need of an Elixir crawler and am willing to devote the bandwidth needed to resolve this issue, but want to make sure I'm starting with proper assumptions.
@adamelmore Hmm, I don't believe there's been any progress on this...
Was there consensus on the approach taken in this PR, and we just need to iron out the kinks (pages being scraped multiple times)? Or should I approach #9 from scratch?
@adamelmore we ran our crawler without issues a couple of times If you want to check the exact dependencies: https://github.com/emcasa/hammock I had to use forks and, since we're not using it anymore, I didn't bother fixing it later when PRs got merged
Is that a private repo, maybe? Getting a 404.
My apologies @adamelmore , we retired and made private I think Here are the deps
{:crawler, github: "rhnonose/crawler"},
{:opq, github: "rhnonose/opq", override: true},
Hey folks, it's been a loooong time. I'm picking up Crawler again, and have made this commit to improve the memory usage: https://github.com/fredwu/crawler/commit/66969f8b58f6e121606ce22f4348723becbe830b
Worker is being spawned by
OPQ
'sWorkerSupervisor
which is aGenStage
'sConsumerSupervisor
and apparently it doesn't dispose of the children after the demand is met, so everyGenServer
started for every event still hangs there consuming the process count.This PR removes that and runs the worker directly.
Testing with this modification, the memory consumption didn't surpass 500mb whereas before it would fill up my machine's memory.
Options used:
Solves #9