Netflix / mantis

A platform that makes it easy for developers to build realtime, cost-effective, operations-focused applications
Apache License 2.0
1.42k stars 202 forks source link

Client Should Not Retry 400 #698

Closed kmg-stripe closed 3 months ago

kmg-stripe commented 3 months ago

Context

We noticed that workers will retry 400 from the master. This should fail fast with an error.

Checklist

github-actions[bot] commented 3 months ago

Test Results

535 tests  ±0   529 :white_check_mark: ±0   7m 48s :stopwatch: +5s 139 suites ±0     6 :zzz: ±0  139 files   ±0     0 :x: ±0 

Results for commit 56b53655. ± Comparison against base commit a61cf1dd.

Andyz26 commented 3 months ago

@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?

kmg-stripe commented 3 months ago

@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?

@Andyz26 - full disclosure, this arose while one of our clusters was in a really bad state due to a few dynamodb provider bugs (they are now fixed).

the issue was that leader would discover all jobs, but could not discover most job workers on start-up, which would lead to excessive re-submissions. since a new leader is elected every time we deploy to masters, this was happening pretty often. our theory was that old agents would ask the leader about assignments, and the leader was returning 400 because of the state inconsistency. the 400 itself wasn't a huge issue, but it was spamming the logs, and would would expect the agents to fail fast instead of retrying.

Andyz26 commented 3 months ago

@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?

@Andyz26 - full disclosure, this arose while one of our clusters was in a really bad state due to a few dynamodb provider bugs (they are now fixed).

the issue was that leader would discover all jobs, but could not discover most job workers on start-up, which would lead to excessive re-submissions. since a new leader is elected every time we deploy to masters, this was happening pretty often. our theory was that old agents would ask the leader about assignments, and the leader was returning 400 because of the state inconsistency. the 400 itself wasn't a huge issue, but it was spamming the logs, and would would expect the agents to fail fast instead of retrying.

my main concern is that since this is a API used in many places, early termination on unexpected control plane result might cause workers to exit/restart when the control plane is in a transient bad state and disrupt critical jobs which currently can survive such problems. In this case would it be cleaner to have control plane emit not_found if it meant to kill the worker in such case?

kmg-stripe commented 3 months ago

@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?

@Andyz26 - full disclosure, this arose while one of our clusters was in a really bad state due to a few dynamodb provider bugs (they are now fixed). the issue was that leader would discover all jobs, but could not discover most job workers on start-up, which would lead to excessive re-submissions. since a new leader is elected every time we deploy to masters, this was happening pretty often. our theory was that old agents would ask the leader about assignments, and the leader was returning 400 because of the state inconsistency. the 400 itself wasn't a huge issue, but it was spamming the logs, and would would expect the agents to fail fast instead of retrying.

my main concern is that since this is a API used in many places, early termination on unexpected control plane result might cause workers to exit/restart when the control plane is in a transient bad state and disrupt critical jobs which currently can survive such problems. In this case would it be cleaner to have control plane emit not_found if it meant to kill the worker in such case?

Ah, understood! Let me find the offending 400 and we can change it to a 404, instead of handling the 400 in the client.

kmg-stripe commented 3 months ago

@Andyz26 Given the state our clusters were in and your comment about transient issues with master, I can close this.

For completeness, here is what I think happened:

https://github.com/Netflix/mantis/blob/a2934008e3976209673c5d1239fb2d5ea4924f91/mantis-control-plane/mantis-control-plane-server/src/main/java/io/mantisrx/master/JobClustersManagerActor.java#L736

https://github.com/Netflix/mantis/blob/a2934008e3976209673c5d1239fb2d5ea4924f91/mantis-control-plane/mantis-control-plane-server/src/main/java/io/mantisrx/master/api/akka/route/v0/BaseRoute.java#L58