Closed andrewberls closed 9 years ago
Merge conflicts?
:+1: thanks for sticking through this. Have you had a chance to empirically run a few jobs through this? Looks good, seems like it should have the right behaviors, looking forward to stuff like the vector clock removing some of the distributed dependencies.
Two random thoughts:
Yes I've run jobs through this and verified the heartbeat behavior, plan to do so again prior to release. You're correct RE schema - we'll need to be careful perhaps since we still have old schema in intake-mixpanel
, and excellent point on CAS violation. I think a process restart is okay in that scenario, since that does seem like a basic invariant violation
@elliot42 Final change - made reserve-job also include an initial heartbeat value, to avoid a dangling window where a job could die before its first heartbeat, now everyone just starts with one
This is an experimental pass at adding "heartbeats", which are an attempt to prevent very long jobs from monopolizing the cluster.
The heartbeat process is a concurrently-running process like the ready job detector or the (now-removed) job completion supervisor. Every 10 seconds, the process saves a heartbeat attribute to the database (the current Unix timestamp, in milliseconds). Every worker is also configured to work as a monitor, which periodically finds jobs that failing more than a certain number of heartbeats and resets their status to unstarted. Since heartbeats are timestamps relative to each worker's relative notion of 'now', it's important to note that this means we are vulnerable to a certain amount of clock drift between nodes. For now,it is the hope that the tolerance figure will make up for this.
Note: A significant change is that started jobs are no longer considered eligible to run! Instead,the heartbeat mechanism is now the primary means by which interrupted jobs or failed workers are recovered.
Consider the following log sample:
Here we can see that node X has started job
56370580-bbb8-4abb-866b-0421a74e6531
and performed a heartbeat, before being forcibly shut down. The job is left dangling.Some time later, the monitor running on node Y, working on a separate job of its own notices that X's job has failed more than the acceptable number of heartbeats and resets it: