Closed Andyz26 closed 1 month ago
@fdc-ntflx ptal
614 tests +75 604 :white_check_mark: +71 8m 7s :stopwatch: +11s 142 suites + 3 10 :zzz: + 4 142 files + 3 0 :x: ± 0
Results for commit a527db33. ± Comparison against base commit 65a79496.
:recycle: This comment has been updated with latest results.
Context
Currently, when a new job gets submitted, all the workers get scheduled in batch to have an all-or-nothing manner. However, the job actor heartbeat check will also try to re-schedule a worker if it's "stuck" in the allocation phase for too long (based on the worker-heartbeat-timeout config). Thus the batch scheduling gets invalidated after some timeout (which could be problematic for larger jobs when we need more time to get the requested resource allocated from the cluster auto scaler).
Behavior changes here:
Checklist
./gradlew build
compiles code correctly./gradlew test
passes all tests