Closed kmg-stripe closed 2 weeks ago
cc: @Andyz26
615 tests ±0 605 ✅ ±0 8m 4s ⏱️ ±0s 142 suites ±0 10 💤 ±0 142 files ±0 0 ❌ ±0
Results for commit b59503ce. ± Comparison against base commit a5874b20.
@kmg-stripe this lgtm. just curious what was the error/root cause you saw to have the workers stuck in accepted? For us we usually get that if the new job artifact contains bugs or failed to init so retry doesn't help in those cases (and sometimes it pollutes our agent pool e.g. error cause frequent agent crash or filled the disk etc.)
@kmg-stripe this lgtm. just curious what was the error/root cause you saw to have the workers stuck in accepted? For us we usually get that if the new job artifact contains bugs or failed to init so retry doesn't help in those cases (and sometimes it pollutes our agent pool e.g. error cause frequent agent crash or filled the disk etc.)
@Andyz26 thanks! yup, this was it. on our end, this was triggered by slowness in the underlying ASG to spin-up new instances. it is an internal limitation we will hope to fix soon, but need to tolerate it as "expected behavior" for now.
A recent change removed worker resubmits when workers are stuck in accepted: https://github.com/Netflix/mantis/pull/719
Our underlying scheduler will not retry the allocations, so we need a way to conditionally enable the ability to resubmit.
Context
I did not see unit tests for the functionality that was removed. I'd be happy to add them, but would like to get this merged first, since we had to pin master to a different version than the agents to avoid stuck workers.
Checklist
./gradlew build
compiles code correctly./gradlew test
passes all tests