Prevent per-node scheduling fallback during batch scheduling retry

Andyz26 commented 1 month ago

Context

Currently, when a new job gets submitted, all the workers get scheduled in batch to have an all-or-nothing manner. However, the job actor heartbeat check will also try to re-schedule a worker if it's "stuck" in the allocation phase for too long (based on the worker-heartbeat-timeout config). Thus the batch scheduling gets invalidated after some timeout (which could be problematic for larger jobs when we need more time to get the requested resource allocated from the cluster auto scaler).

Behavior changes here:

Batch scheduling failure will retry without attempt limit. (We will rely on the cancel request message from its job actor to interrupt).
JobActor heartbeat routine will not act on unscheduled workers.
fixed the akka-tests and added these back to the CI build.

Checklist

[ ] ./gradlew build compiles code correctly
[ ] Added new tests where applicable
[ ] ./gradlew test passes all tests
[ ] Extended README or added javadocs where applicable

Andyz26 commented 1 month ago

@fdc-ntflx ptal

github-actions[bot] commented 1 month ago

Test Results

614 tests +75 604 :white_check_mark: +71 8m 7s :stopwatch: +11s 142 suites + 3 10 :zzz: + 4 142 files + 3 0 :x: ± 0

Results for commit a527db33. ± Comparison against base commit 65a79496.

:recycle: This comment has been updated with latest results.

Netflix / mantis