Client Should Not Retry 400

kmg-stripe commented 3 months ago

Context

We noticed that workers will retry 400 from the master. This should fail fast with an error.

Checklist

[ ] ./gradlew build compiles code correctly
[X] Added new tests where applicable
[ ] ./gradlew test passes all tests
[ ] Extended README or added javadocs where applicable

github-actions[bot] commented 3 months ago

Test Results

535 tests ±0 529 :white_check_mark: ±0 7m 48s :stopwatch: +5s 139 suites ±0 6 :zzz: ±0 139 files ±0 0 :x: ±0

Results for commit 56b53655. ± Comparison against base commit a61cf1dd.

Andyz26 commented 3 months ago

@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?

kmg-stripe commented 3 months ago

@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?

@Andyz26 - full disclosure, this arose while one of our clusters was in a really bad state due to a few dynamodb provider bugs (they are now fixed).

the issue was that leader would discover all jobs, but could not discover most job workers on start-up, which would lead to excessive re-submissions. since a new leader is elected every time we deploy to masters, this was happening pretty often. our theory was that old agents would ask the leader about assignments, and the leader was returning 400 because of the state inconsistency. the 400 itself wasn't a huge issue, but it was spamming the logs, and would would expect the agents to fail fast instead of retrying.

Andyz26 commented 3 months ago

@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?

@Andyz26 - full disclosure, this arose while one of our clusters was in a really bad state due to a few dynamodb provider bugs (they are now fixed).

the issue was that leader would discover all jobs, but could not discover most job workers on start-up, which would lead to excessive re-submissions. since a new leader is elected every time we deploy to masters, this was happening pretty often. our theory was that old agents would ask the leader about assignments, and the leader was returning 400 because of the state inconsistency. the 400 itself wasn't a huge issue, but it was spamming the logs, and would would expect the agents to fail fast instead of retrying.

my main concern is that since this is a API used in many places, early termination on unexpected control plane result might cause workers to exit/restart when the control plane is in a transient bad state and disrupt critical jobs which currently can survive such problems. In this case would it be cleaner to have control plane emit not_found if it meant to kill the worker in such case?

kmg-stripe commented 3 months ago

@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?

@Andyz26 - full disclosure, this arose while one of our clusters was in a really bad state due to a few dynamodb provider bugs (they are now fixed). the issue was that leader would discover all jobs, but could not discover most job workers on start-up, which would lead to excessive re-submissions. since a new leader is elected every time we deploy to masters, this was happening pretty often. our theory was that old agents would ask the leader about assignments, and the leader was returning 400 because of the state inconsistency. the 400 itself wasn't a huge issue, but it was spamming the logs, and would would expect the agents to fail fast instead of retrying.

my main concern is that since this is a API used in many places, early termination on unexpected control plane result might cause workers to exit/restart when the control plane is in a transient bad state and disrupt critical jobs which currently can survive such problems. In this case would it be cleaner to have control plane emit not_found if it meant to kill the worker in such case?

Ah, understood! Let me find the offending 400 and we can change it to a 404, instead of handling the 400 in the client.

kmg-stripe commented 3 months ago

@Andyz26 Given the state our clusters were in and your comment about transient issues with master, I can close this.

For completeness, here is what I think happened:

Due to the weird state of our clusters, we were hitting "Failed to get Sched info stream info":
https://github.com/Netflix/mantis/blob/a2934008e3976209673c5d1239fb2d5ea4924f91/mantis-control-plane/mantis-control-plane-server/src/main/java/io/mantisrx/master/api/akka/route/handlers/JobDiscoveryRouteHandlerAkkaImpl.java#L134
It looks like this results in CLIENT_ERROR_NOT_FOUND:

https://github.com/Netflix/mantis/blob/a2934008e3976209673c5d1239fb2d5ea4924f91/mantis-control-plane/mantis-control-plane-server/src/main/java/io/mantisrx/master/JobClustersManagerActor.java#L736

Then, I think CLIENT_ERROR_NOT_FOUND is transformed into BAD_REQUEST:

https://github.com/Netflix/mantis/blob/a2934008e3976209673c5d1239fb2d5ea4924f91/mantis-control-plane/mantis-control-plane-server/src/main/java/io/mantisrx/master/api/akka/route/v0/BaseRoute.java#L58

Netflix / mantis