Netflix / mantis

A platform that makes it easy for developers to build realtime, cost-effective, operations-focused applications
Apache License 2.0
1.42k stars 202 forks source link

Prevent per-node scheduling fallback during batch scheduling retry #719

Closed Andyz26 closed 1 month ago

Andyz26 commented 1 month ago

Context

Currently, when a new job gets submitted, all the workers get scheduled in batch to have an all-or-nothing manner. However, the job actor heartbeat check will also try to re-schedule a worker if it's "stuck" in the allocation phase for too long (based on the worker-heartbeat-timeout config). Thus the batch scheduling gets invalidated after some timeout (which could be problematic for larger jobs when we need more time to get the requested resource allocated from the cluster auto scaler).

Behavior changes here:

Checklist

Andyz26 commented 1 month ago

@fdc-ntflx ptal

github-actions[bot] commented 1 month ago

Test Results

614 tests  +75   604 :white_check_mark: +71   8m 7s :stopwatch: +11s 142 suites + 3    10 :zzz: + 4  142 files   + 3     0 :x: ± 0 

Results for commit a527db33. ± Comparison against base commit 65a79496.

:recycle: This comment has been updated with latest results.