fzaninotto / screenshot-as-a-service

Website screenshot service powered by node.js and phantomjs
1.1k stars 243 forks source link

Screenshot Service continues to kill and restart phantom in many request scenario #47

Open jaydiablo opened 10 years ago

jaydiablo commented 10 years ago

We recently ran into a situation where we're sending thousands of "jobs" that hit this screenshot service one after another (in our tests we were actually hitting the service with two different workers, so a maximum of 2 simultaneous requests would be coming in, but they send a new request as soon as the previous one finishes).

Pretty reliably, phantom will start to consume too much memory and crash (we think we can easily resolve this by making more resources available to this process, but ultimately won't fix this issue we're seeing). When phantom crashes, the screenshot service restarts it on the next request, however, if a request comes in before phantom is "ready" to accept new requests, the screenshot service will receive a ECONNREFUSED error from phantom, which triggers the restart process all over again. This ends up causing phantom to be killed and restarted every time a request comes in after the initial crash. In our case, this will basically clear out our job queue, failing every request because phantom isn't given enough time to start up before the screenshot service kills and restarts it.

I have been testing a small fix that will at least give phantom some time to restart before the kill/start process is initiated, but wanted to solicit some input from the community on how to best deal with this.

Ideally, it seems that the request that kicks off the restart of phantom should wait until phantom is ready before returning the response, and any requests that come in during this time should also block until phantom is ready. I didn't see an obvious way to tell if phantom is "ready", other than receiving a successful rasterize response from it. Is there some sort of callback we can use, or event we can listen for to determine if phantom is started and ready for requests? Should the restart request loop on an interval until it can get a successful response from phantom? (begs the question, what if phantom never starts, is this request stuck in a loop?)

My "fix" that I'm testing currently takes a slightly different approach, sort of in tune with how the screenshot service acts currently. The restart of phantom is kicked off, but the restarting request will return an error response (as it does now). However, it sets a variable to prevent any subsequent requests from also killing and restarting phantom (they'll just return an error response as well if this variable is set) but once phantom is up and returns a valid response (or the pingcheck comes back successfully), this variable is wiped in case phantom needs to be restarted in the future. So when hitting the screenshot service with thousands of back to back requests, this may cause a handful of them to fail when phantom is down, but will recover once phantom is back up rather than returning an error response for every subsequent request.

Note that I don't see the issue mentioned in #32 as the phantom process is definitely killed and restarted. If I watch output from ps I see the process id on the phantom process change every time ps is updated (and there is only ever one (or none) process hanging around).