Closed wdbaruni closed 4 weeks ago
[!IMPORTANT]
Review skipped
Auto reviews are disabled on this repository.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yaml
file in this repository. To trigger a single review, invoke the@coderabbitai review
command.You can disable this status message by setting the
reviews.review_status
tofalse
in the CodeRabbit configuration file.
Something I think may be helpful to clarify in documentation is: Types of jobs that can be queued v jobs that do not queue.
My current read of this leaves me with the following understanding: Batch jobs may be queued when requirements are not met. All other jobs: service, daemon, and ops will not queue if requirements are not met and fail immediately.
It may also be helpful to reject un-queueable jobs on the client side in the event a client sets QueueTimeout
in the job spec for a job type that is not batch.
This PR introduces job queueing when no matching node is available in the network. This can be due to all nodes are currently busy processing other jobs, or no node matches the job constraints, such as label selectors, engines or publishers.
QueueTimeout
By default, queueing is disabled and jobs will fail immediately. Users can enable queueing and how long a job can wait in the queue by setting
QueueTimeout
to a value greater than zero. There are two ways to set this value:Job Spec
Users can set this value in the job spec when calling
bacalhau job run spec.yaml
such as:Requester Node Configuration
Operators can set a default
QueueTimeout
in the Requester node's configurations so that all submitted jobs with noQueueTimeout
can be assigned the configured default value. The configuration looks like:QueueBackoff
The wait the requester node works is that will keep retrying scheduling the jobs every
QueueBackoff
window, which is also configured as shown above and defaults to 1 minute. A future improvement is to removeQueueBackoff
and let the scheduler listen to node and cluster changes and re-queue a job only when it believes it can be rescheduled instead of just blindly retrying everyQueueBackoff
.Testing
A pre-release has been cut with this change along with https://github.com/bacalhau-project/bacalhau/pull/4051, and has been deployed to development. You can also using the below examples to test against development, just make sure you are using the same client in the pre-release
Caveat
The compute nodes heartbeat their available resources every 30seconds. If there is a spike in jobs submitted in a short period of time, the the requester might over subscriber a compute node as it will take time before it knows it is full. This won't fail the jobs, but will the compute nodes will queue the jobs locally instead of the requester. If new compute nodes join, the requester won't move jobs from the first compute node. This is related to moving away from rejecting jobs because the local queue is full discussed here. There are may ways to improve this, and I'll open a follow up issue for it, but for now wait some time between job submission to have more predictable tests.
Sample Job
This is a sample job that takes 5 minutes to finish, configured with queueing enabled up to 1 hour, and requires 3 CPU units. There are two compute nodes in development with 3.2 CPU units each.
Scenario 1: Busy resources
Scenario 2: No available node
Run job that only asks for that ask for a node with
name=walid
or any other nameRun the job and describe it. It should be in pending state and not failed
Join you rmachine as a compute node in a separate terminal, and give it the unique label, like
name=walid
Describe the job again and it should be in running or completed state
Scenario 3: No queueing
Test the previous scenarios with no queue timeout defined, and the jobs should fail immediately.
Future Improvements
--queue-timeout
flag todocker run
to allow queueing with imperative job submissions (P1)QueueBackoff
to listening to cluster state changes (Not a priority)