introduce job queueing when no nodes were found

wdbaruni commented 1 month ago

This PR introduces job queueing when no matching node is available in the network. This can be due to all nodes are currently busy processing other jobs, or no node matches the job constraints, such as label selectors, engines or publishers.

QueueTimeout

By default, queueing is disabled and jobs will fail immediately. Users can enable queueing and how long a job can wait in the queue by setting QueueTimeout to a value greater than zero. There are two ways to set this value:

Job Spec

Users can set this value in the job spec when calling bacalhau job run spec.yaml such as:

Type: batch
Count: 1
Tasks:
  - Name: main
    Engine:
      Type: docker
      Params:
        Image: ubuntu:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - sleep 90
    Timeouts:
      QueueTimeout: 3600

Requester Node Configuration

Operators can set a default QueueTimeout in the Requester node's configurations so that all submitted jobs with no QueueTimeout can be assigned the configured default value. The configuration looks like:

Node:
    Requester:
        JobDefaults:
            QueueTimeout: 3600s
        Scheduler:
            QueueBackoff: 1m0s

QueueBackoff

The wait the requester node works is that will keep retrying scheduling the jobs every QueueBackoff window, which is also configured as shown above and defaults to 1 minute. A future improvement is to remove QueueBackoff and let the scheduler listen to node and cluster changes and re-queue a job only when it believes it can be rescheduled instead of just blindly retrying every QueueBackoff.

Testing

A pre-release has been cut with this change along with https://github.com/bacalhau-project/bacalhau/pull/4051, and has been deployed to development. You can also using the below examples to test against development, just make sure you are using the same client in the pre-release

Caveat

The compute nodes heartbeat their available resources every 30seconds. If there is a spike in jobs submitted in a short period of time, the the requester might over subscriber a compute node as it will take time before it knows it is full. This won't fail the jobs, but will the compute nodes will queue the jobs locally instead of the requester. If new compute nodes join, the requester won't move jobs from the first compute node. This is related to moving away from rejecting jobs because the local queue is full discussed here. There are may ways to improve this, and I'll open a follow up issue for it, but for now wait some time between job submission to have more predictable tests.

Sample Job

This is a sample job that takes 5 minutes to finish, configured with queueing enabled up to 1 hour, and requires 3 CPU units. There are two compute nodes in development with 3.2 CPU units each.

Name: A slow job
Type: batch
Count: 1
Tasks:
  - Name: main
    Engine:
      Type: docker
      Params:
        Image: ubuntu:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - sleep 300
    Resources:
      CPU: "3"
    Timeouts:
      QueueTimeout: 3600

Scenario 1: Busy resources

# both jobs should start immediately in different nodes
bacalhau job run --wait=false slow.yaml
bacalhau job run --wait=false slow.yaml

# validate the jobs are running in different nodes
bacalhau job describe `<job_id>`

# wait >30 seconds. Most likely they will be in pending state, 
# but it can happen that the requester is not aware yet of available resources. 
# wait and try again until the job state is pending
bacalhau job run --wait=false slow.yaml

# After >5 minutes, describe the pending job and it should move from pending to running
bacalhau job describe `<job_id>`

Scenario 2: No available node

Run job that only asks for that ask for a node with name=walid or any other name

Name: A constrained job
Type: batch
Count: 1
Constraints:
  - operator: "="
    Key: "name"
    Values: ["walid"]
Tasks:
  - Name: main
    Engine:
      Type: docker
      Params:
        Image: ubuntu:latest
        Entrypoint:
          - /bin/bash
        Parameters:
          - -c
          - sleep 10
    Resources:
      CPU: "0.1"
    Timeouts:
      QueueTimeout: 3600

Run the job and describe it. It should be in pending state and not failed

bacalhau job run --wait=false constrained.yaml

Join you rmachine as a compute node in a separate terminal, and give it the unique label, like name=walid

export BACALHAU_DIR=$(mktemp -d)
bacalhau serve --node-type=compute --orchestrators=bootstrap.development.bacalhau.org:4222 --labels name=walid

Describe the job again and it should be in running or completed state

Scenario 3: No queueing

Test the previous scenarios with no queue timeout defined, and the jobs should fail immediately.

Future Improvements

Improve visibility of queued jobs, and why they are being queued (P0)
Add a --queue-timeout flag to docker run to allow queueing with imperative job submissions (P1)
Improve detection of available and local queue capacity of compute nodes to avoid over-subscribing nodes (P2)
Move away from QueueBackoff to listening to cluster state changes (Not a priority)

coderabbitai[bot] commented 1 month ago

[!IMPORTANT]

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Tips

### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

frrist commented 4 weeks ago

Something I think may be helpful to clarify in documentation is: Types of jobs that can be queued v jobs that do not queue.

My current read of this leaves me with the following understanding: Batch jobs may be queued when requirements are not met. All other jobs: service, daemon, and ops will not queue if requirements are not met and fail immediately.

It may also be helpful to reject un-queueable jobs on the client side in the event a client sets QueueTimeout in the job spec for a job type that is not batch.

bacalhau-project / bacalhau