framed-data / overseer

Overseer is a library for building and running data pipelines in Clojure.
Eclipse Public License 1.0
97 stars 10 forks source link

Ignore CAS state exceptions in heartbeat monitor #77

Closed elliot42 closed 8 years ago

elliot42 commented 8 years ago

Prior to this commit, the heartbeat monitor had some behavior design conflicts:

  1. The heartbeat monitor uses the compare-and-set transaction function to reset the status on heartbeat-dead jobs. This throws when an entity/attribute is not in the expected state, e.g. when someone else has already reset a job back to :unstarted
  2. All of the workers' heartbeat monitors are constantly racing each other, which means that in the case of a failed heartbeat, exactly one worker will succeed at resetting that job, and all of the other workers in the cluster will see the job in post-reset unexpected state, and throw.
  3. We currently have code that explicitly restarts the entire worker if there's ever any kind of exception in the heartbeat monitor.

This leads to a mess of errors and unnecessary restarts across the cluster, which pollutes the log and drags down cluster throughput since worst case all other N-1 workers are all restarting in the case of any failed heartbeat (only 1 workers wins at resetting the heartbeat, N-1 workers race, throw and restart).

Even worse, when a worker is force-restarted, the job that it was working on becomes and abandoned and eventually fails heartbeat itself, thus perpetuating the cycle.

This commit switches the heartbeat monitor to ignore CAS exceptions in the heartbeat monitor--if someone else has already reset the job, it's totally OK and the worker doesn't need to restart.

The code is fairly specifically scoped, so any other unexpected type of error will still restart the worker, but this specific case certainly should not restart the worker.

This should fix the spurious errors and restarts we're seeing across the cluster.

We still want the CAS in place, e.g. so that a heartbeat monitor cannot reset a job that has already :finished while the worker wasn't watching--however we just don't want the CAS to throw and reset the worker in this case, we just want to ignore and continue monitoring.

andrewberls commented 8 years ago

Comments inline. :+1: very insightful

elliot42 commented 8 years ago

Updated

andrewberls commented 8 years ago

Need Circle on your fork

elliot42 commented 8 years ago

Fiddling with that now

andrewberls commented 8 years ago

:+1:

elliot42 commented 8 years ago

I'll look around at if there are any log level standard definitions