Prior to this commit, the heartbeat monitor had some behavior
design conflicts:
The heartbeat monitor uses the compare-and-set transaction function
to reset the status on heartbeat-dead jobs. This throws when an
entity/attribute is not in the expected state, e.g. when someone
else has already reset a job back to :unstarted
All of the workers' heartbeat monitors are constantly racing
each other, which means that in the case of a failed heartbeat,
exactly one worker will succeed at resetting that job, and
all of the other workers in the cluster will see the job in
post-reset unexpected state, and throw.
We currently have code that explicitly restarts the entire worker
if there's ever any kind of exception in the heartbeat monitor.
This leads to a mess of errors and unnecessary restarts across the
cluster, which pollutes the log and drags down cluster throughput
since worst case all other N-1 workers are all restarting in the case of
any failed heartbeat (only 1 workers wins at resetting the heartbeat,
N-1 workers race, throw and restart).
Even worse, when a worker is force-restarted, the job that it was
working on becomes and abandoned and eventually fails heartbeat
itself, thus perpetuating the cycle.
This commit switches the heartbeat monitor to ignore CAS exceptions in
the heartbeat monitor--if someone else has already reset the job,
it's totally OK and the worker doesn't need to restart.
The code is fairly specifically scoped, so any other unexpected type
of error will still restart the worker, but this specific case
certainly should not restart the worker.
This should fix the spurious errors and restarts we're seeing across
the cluster.
We still want the CAS in place, e.g. so that a heartbeat monitor
cannot reset a job that has already :finished while the worker wasn't
watching--however we just don't want the CAS to throw and reset the
worker in this case, we just want to ignore and continue monitoring.
Prior to this commit, the heartbeat monitor had some behavior design conflicts:
This leads to a mess of errors and unnecessary restarts across the cluster, which pollutes the log and drags down cluster throughput since worst case all other N-1 workers are all restarting in the case of any failed heartbeat (only 1 workers wins at resetting the heartbeat, N-1 workers race, throw and restart).
Even worse, when a worker is force-restarted, the job that it was working on becomes and abandoned and eventually fails heartbeat itself, thus perpetuating the cycle.
This commit switches the heartbeat monitor to ignore CAS exceptions in the heartbeat monitor--if someone else has already reset the job, it's totally OK and the worker doesn't need to restart.
The code is fairly specifically scoped, so any other unexpected type of error will still restart the worker, but this specific case certainly should not restart the worker.
This should fix the spurious errors and restarts we're seeing across the cluster.
We still want the CAS in place, e.g. so that a heartbeat monitor cannot reset a job that has already :finished while the worker wasn't watching--however we just don't want the CAS to throw and reset the worker in this case, we just want to ignore and continue monitoring.