Netflix / mantis

A platform that makes it easy for developers to build realtime, cost-effective, operations-focused applications
Apache License 2.0
1.42k stars 202 forks source link

Resubmit Worker Allocations #725

Closed kmg-stripe closed 2 weeks ago

kmg-stripe commented 2 weeks ago

A recent change removed worker resubmits when workers are stuck in accepted: https://github.com/Netflix/mantis/pull/719

Our underlying scheduler will not retry the allocations, so we need a way to conditionally enable the ability to resubmit.

Context

I did not see unit tests for the functionality that was removed. I'd be happy to add them, but would like to get this merged first, since we had to pin master to a different version than the agents to avoid stuck workers.

Checklist

kmg-stripe commented 2 weeks ago

cc: @Andyz26

github-actions[bot] commented 2 weeks ago

Test Results

615 tests  ±0   605 ✅ ±0   8m 4s ⏱️ ±0s 142 suites ±0    10 💤 ±0  142 files   ±0     0 ❌ ±0 

Results for commit b59503ce. ± Comparison against base commit a5874b20.

Andyz26 commented 2 weeks ago

@kmg-stripe this lgtm. just curious what was the error/root cause you saw to have the workers stuck in accepted? For us we usually get that if the new job artifact contains bugs or failed to init so retry doesn't help in those cases (and sometimes it pollutes our agent pool e.g. error cause frequent agent crash or filled the disk etc.)

kmg-stripe commented 2 weeks ago

@kmg-stripe this lgtm. just curious what was the error/root cause you saw to have the workers stuck in accepted? For us we usually get that if the new job artifact contains bugs or failed to init so retry doesn't help in those cases (and sometimes it pollutes our agent pool e.g. error cause frequent agent crash or filled the disk etc.)

@Andyz26 thanks! yup, this was it. on our end, this was triggered by slowness in the underlying ASG to spin-up new instances. it is an internal limitation we will hope to fix soon, but need to tolerate it as "expected behavior" for now.