Closed borisovg closed 8 years ago
Yep, this is a massive undertaking, and has been avoided for a long time now.
Per the Resque spec, errors/crashes from the worker should move the job to resque:error
queue. However, the spec has up "popping" the job out of the work queues, so that while the job is being worked, the worker's status (which contains a copy of the original job payload) is the only place that job resides.
The domain
error catching mechanism of node allows us to handle uncaught exceptions (and unresolved promises) and elegantly place the job into the error queue when something goes wrong. Without this, we'll need to take one of the following approaches:
1) Single Process onError
. If we force only one resque worker to exist per process (which removes the possibility of multiworker), we can hook into the processes exit event to check on if where was a job we thought we were working on. This will create havock with any higher-level process management.
2) Forking. Like the Ruby Resque implementation, we could have a "leader" process which gets the job from Redis, and a "follower" process which actually works the job. The "leader" can watch for "follower" failures, and act accordingly. This will be slower and consume more RAM. This will create havock with any higher-level process management.
3) Timing. We could implement a "maxJobWorkingTime". If a worker crashes, the payload of the job is stored in Redis. After a certain amount of time, another worker can check Redis for "all workers which started working over 24 hours ago", note them as failed, and clean them up. This won't cause the process issues the other solutions would, but it creates a very long cycle time before noticing anything is wrong. It also breaks any retry/backoff plugins.
4) We could just not care. If your process crashes/creates an exception, the job is lost forever. I hope you have good logs!
I'm leaning to .4 right now (just don't care). The higher-level process which uses node-resque
can handle errors in their own way. We can suggest that folks use the worker.fail(worker.job)
method to call on exceptions, but that would the paren't application's job to implement.
I see - it is an interesting problem. It is particularly amusing that the module was deprecated and the Node people still haven't come up with a solution. :)
.1 and .3 suck ;)
.2 would be the way to do it if you want to guarantee that the spec is followed. It will be slower (and consume more memory) since you can only pass job by value to the child process. "Slower" might not be important since in real life the job callback may take much longer in comparison, but the extra memory may be more problematic if there are huge jobs. Then again, if there's nothing to stop one running a second set of processes on a different box in that situation.
.4 is for grown-ups - burden is on the user to take responsibility for their own crashes. It is the option I would chose if I was writing the module for myself since I am very much of the "crash on unexpected error" school of thought but you're in the best position to know your own user base. And yes, I have excellent logging, thank you for asking. ;)
I'm with you! I'm actually quite worried about .2 because you can't gaurnetee things are corrected properly. Say a job has to connect to mySQL and do something... it will be a real developer pain to build all thier connection logic into an "onFork" block.
I'm going ahead with .4 over here -> https://github.com/taskrabbit/node-resque/pull/167
Will be solved via https://github.com/taskrabbit/node-resque/pull/167
https://nodejs.org/api/domain.html