fredwu / crawler

A high performance web crawler / scraper in Elixir.
945 stars 91 forks source link

Crawler killed from consuming all memory or processes #9

Closed robvolk closed 1 year ago

robvolk commented 6 years ago

I'm having trouble getting the crawler to successfully crawl with a depth of more than 2. It's able to filter many links and scrape pages, but after a couple of minutes the process gets killed. At the end it outputs:

erl_child_setup closed
Killed

I don't see any errors on the console and the erl_crash.dump doesn't really say much. I see a bunch of references to ('Elixir.Crawler.QueueHandler':enqueue/1 + 248), but that's all.

Has anyone else seen this?

robvolk commented 6 years ago

I figured out the culprit, but I don't know how to fix it. I'm hitting the memory limit on my machine and the Erlang VM is killing my crawler process.

From what I can tell, the crawler first crawls all pages and loads the HTML into memory, then passes it to the scraper to process. How could we change the logic to allow the scrapers to process the content while the crawler runs? This way we can dispose of the page content that we've already processed, and allow the crawler to run on really really big sites.

fredwu commented 6 years ago

Hey @robvolk, how are you using Crawler, as in, with what options? I recommend you use the :interval option to rate limit, if you haven't already.

As for the second part of your question, I'm not sure if I understand it. At the moment a Crawler worker will crawl one single page, load that HTML in memory, then ask other workers (when they become available) to perform on the links it finds on that given page. The architecture diagram in the README reflects this approach. Would you be able to draw up your proposed approach?

robvolk commented 6 years ago

Thanks for replying so fast. I'm trying out different configuration options to get the crawler to work with the limited amount of memory on my box. Your design works well and after trying out different config options, I don't think anything needs to change with the system, but instead I just need to find the optimal configuration settings to get it to work right.

If I use a low interval and high number of works, then the scrapers start working while crawling happens, but the crawler rapidly consumes all memory because it's crawling so fast. ex: interval: 10, workers: 50.

If I slow things down, then the scrapers don't even start working until all the memory is consumed, like interval: 200, workers 5.

I'm going to try out different config options, as well as a much much larger box that I'll just create and destroy as I need it, instead of keeping it up all the time. I'll update this thread with what I found works best for my use case and it might help others as well.

robvolk commented 6 years ago

It looks like even after the crawler has run, the process is still consuming a ton of memory. Is it possible that some of the the crawler tasks don't dispose of the pages they load into memory?

robvolk commented 6 years ago

I got the crawler to run using a nice big 64GB box. After a while I hit the Erlang process limit. I increased the limit ridiculously high, then hit another process limit. Have you run into this process limit?

fredwu commented 6 years ago

No I haven't tried on any large websites. Would you mind sharing or sending me (ifredwu at gmail) the site you're having trouble with? Crawler might be holding onto processes longer than necessary.

jurgenwerk commented 6 years ago

Hey @robvolk and @fredwu, any update on this issue? Were you able to fix it? We're looking for a scraper capable of scraping larger sites and I'm wondering if this library can handle it.

robvolk commented 6 years ago

I spoke with @fredwu about it and the short answer is at current state it can't scrape large sites. It needs some way of throttling the workers like a queue of sorts. Right now it launches a new process for every link it finds recursively, so you can see how you'd get to millions of processes very quickly.

It probably wouldn't take a ton of effort to retool it though, especially with the new queue & task features that Elixir has added to the language. I don't have time to implement a fix, but would be more than happy to chat about the problem if you're able to invest the time to implement it. @matixmatix

rhnonose commented 6 years ago

I'm also interested in solving this issue. I'll have time this following week to look into it and come up with something.

@robvolk do you mind giving some direction? I'm using the crawler but just skimmed through the architecture, not too deep.

robvolk commented 6 years ago

Hi @rhnonose, I'm happy to help. I think that it needs some kind of task queuing system so the crawler doesn't try to process every single link it finds immediately. If a site has 30 links and you go 5 levels deep, it'll spin up 24M processes. That's why it's failing - the Erlang default is under 1M processes.

Elixir has some new Task libraries to help, but I'm not sure if it can do queueing native yet. It might be easier to hop on a call to to hash it out - I'll shoot you an email.

rhnonose commented 6 years ago

Thanks @robvolk, looking forward to.

fredwu commented 6 years ago

Hmm, we do use a queue already, crawling itself enqueues the task: https://github.com/fredwu/crawler/blob/1fb5bfdc67f973b3ff0118d9475bc50f1d6d7b71/lib/crawler.ex#L27-L33

The issue could be a bug somewhere in the system due to the crawling strategy - Crawler crawls pages based on what it sees, instead of based on the page depth. Or it can be a bug in the queuing system.

Just some food for thought, as I haven't had much chance to look into this after I wrote Crawler initially.

rhnonose commented 6 years ago

@fredwu but from what I've read in the code, every new found link calls Crawler.crawl and OPQ.init starts a new process, am I wrong? So it's starting a new "queue" for every link from what I understood.

rhnonose commented 6 years ago

My bad, it calls OPQ.enqueue.

fredwu commented 6 years ago

That's correct - it adds the link to the queue to be processed, the queue itself can be controlled/configured using :workers and :interval: https://github.com/fredwu/crawler#configurations

rhnonose commented 6 years ago

I might have a guess.

The crawler sets a worker in Crawler.QueueHandler, which is Crawler.Dispatcher.Worker.

Now Crawler.Dispatcher.Worker calls a Task.start_link which ends up in Worker.run, where it starts a gen_server for each run the worker is called for.

Setting the interval really high, workers to only two, and taking a look at the observer:

image

Seems like there's a bunch of worker processes that should've died/been recicled.

That seems like the reason the process count is exploding. Thughts @fredwu @robvolk ?

fredwu commented 1 year ago

Hey folks, it's been a loooong time. I'm picking up Crawler again, and have made this commit to improve the memory usage: https://github.com/fredwu/crawler/commit/66969f8b58f6e121606ce22f4348723becbe830b