Closed kmg-stripe closed 3 months ago
535 tests ±0 529 :white_check_mark: ±0 7m 48s :stopwatch: +5s 139 suites ±0 6 :zzz: ±0 139 files ±0 0 :x: ±0
Results for commit 56b53655. ± Comparison against base commit a61cf1dd.
@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?
@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?
@Andyz26 - full disclosure, this arose while one of our clusters was in a really bad state due to a few dynamodb provider bugs (they are now fixed).
the issue was that leader would discover all jobs, but could not discover most job workers on start-up, which would lead to excessive re-submissions. since a new leader is elected every time we deploy to masters, this was happening pretty often. our theory was that old agents would ask the leader about assignments, and the leader was returning 400
because of the state inconsistency. the 400
itself wasn't a huge issue, but it was spamming the logs, and would would expect the agents to fail fast instead of retrying.
@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?
@Andyz26 - full disclosure, this arose while one of our clusters was in a really bad state due to a few dynamodb provider bugs (they are now fixed).
the issue was that leader would discover all jobs, but could not discover most job workers on start-up, which would lead to excessive re-submissions. since a new leader is elected every time we deploy to masters, this was happening pretty often. our theory was that old agents would ask the leader about assignments, and the leader was returning
400
because of the state inconsistency. the400
itself wasn't a huge issue, but it was spamming the logs, and would would expect the agents to fail fast instead of retrying.
my main concern is that since this is a API used in many places, early termination on unexpected control plane result might cause workers to exit/restart when the control plane is in a transient bad state and disrupt critical jobs which currently can survive such problems. In this case would it be cleaner to have control plane emit not_found if it meant to kill the worker in such case?
@kmg-stripe, in which case do you see the control plane throwing bad request results for this call?
@Andyz26 - full disclosure, this arose while one of our clusters was in a really bad state due to a few dynamodb provider bugs (they are now fixed). the issue was that leader would discover all jobs, but could not discover most job workers on start-up, which would lead to excessive re-submissions. since a new leader is elected every time we deploy to masters, this was happening pretty often. our theory was that old agents would ask the leader about assignments, and the leader was returning
400
because of the state inconsistency. the400
itself wasn't a huge issue, but it was spamming the logs, and would would expect the agents to fail fast instead of retrying.my main concern is that since this is a API used in many places, early termination on unexpected control plane result might cause workers to exit/restart when the control plane is in a transient bad state and disrupt critical jobs which currently can survive such problems. In this case would it be cleaner to have control plane emit not_found if it meant to kill the worker in such case?
Ah, understood! Let me find the offending 400
and we can change it to a 404
, instead of handling the 400
in the client.
@Andyz26 Given the state our clusters were in and your comment about transient issues with master, I can close this.
For completeness, here is what I think happened:
Due to the weird state of our clusters, we were hitting "Failed to get Sched info stream info":
https://github.com/Netflix/mantis/blob/a2934008e3976209673c5d1239fb2d5ea4924f91/mantis-control-plane/mantis-control-plane-server/src/main/java/io/mantisrx/master/api/akka/route/handlers/JobDiscoveryRouteHandlerAkkaImpl.java#L134
It looks like this results in CLIENT_ERROR_NOT_FOUND
:
CLIENT_ERROR_NOT_FOUND
is transformed into BAD_REQUEST
:
Context
We noticed that workers will retry
400
from the master. This should fail fast with an error.Checklist
./gradlew build
compiles code correctly./gradlew test
passes all tests