Today we don't have a dedicated component in the Requester node that manages the compute nodes in the network. We only have a transient NodeInfoStore that discovers compute nodes through gossiping, and evict nodes that stop heartbeating for over 10 minutes. This state is transient, will be lost when the Requester node is restarted, and not possible for operators to interact with, such as draining or manually marking a node as unhealthy.
In addition to that, we don't have a mechanism today to schedule daemon jobs on newly joined compute nodes that match the job selection criteria of existing daemon jobs, or when a compute node changes its labels to match or no longer match existing daemon jobs.
As an operator, I want to have the power to manually approve compute nodes before they join the network, ensuring better control and security.
As an operator, I need a feature that allows for the auto-joining of pre-authorized compute nodes, eliminating the need for manual intervention.
As an operator, I want to mark certain nodes as ineligible for new job placements, while existing jobs on these nodes can continue to run uninterrupted.
As an operator, I want to drain nodes when necessary and have Bacalhau re-schedule the jobs to other nodes.
As an operator, I want the ability to add labels to nodes from the control plane and override any labels that the compute nodes have set themselves.
As an operator, I need to set the criteria for timeouts and missed heartbeats that will determine when a node is considered in 'unknown' state.
As an operator, I need to set the criteria for timeouts and missed heartbeats that will determine when a node is considered 'unhealthy'.
As an operator, I want the flexibility to set these timeout and heartbeat thresholds at individual node levels, for groups of nodes, or as default settings for any nodes not specified.
As a user, I expect Bacalhau to automatically reschedule 'batch' and 'service' jobs if their nodes are marked as unhealthy.
As a user, I want Bacalhau to mark the execution of 'ops' jobs as failed if their nodes are flagged as unhealthy.
As a user, I want Bacalhau to schedule daemon jobs on new compute nodes that join the network and match selection criteria of existing jobs
As a user, I want Bacalhau to evict daemon jobs on existing compute nodes if their labels are updated to no longer match existing jobs
The Proposal
The proposal is to have a dedicated NodeManger component inside the Requester node that persists the state of compute nodes, managers their lifecycle, and expose APIs to allow operators to query and mutate the nodes.
The NodeManager will also be responsible for queueing new evaluations for daemon jobs if new nodes join, or their labels are updated.
The NodeManager shall persist stable info about compute nodes, such as their state, labels and supported engines, and store in-memory always changing metadata, such as available resources
Open Questions
Evaluate the need for an explicit RegisterNode API that compute nodes shall call when they start up and want to register with a requester node, instead of just always publishing their info.
The Problem
Today we don't have a dedicated component in the Requester node that manages the compute nodes in the network. We only have a transient
NodeInfoStore
that discovers compute nodes through gossiping, and evict nodes that stop heartbeating for over 10 minutes. This state is transient, will be lost when the Requester node is restarted, and not possible for operators to interact with, such as draining or manually marking a node as unhealthy.In addition to that, we don't have a mechanism today to schedule daemon jobs on newly joined compute nodes that match the job selection criteria of existing daemon jobs, or when a compute node changes its labels to match or no longer match existing daemon jobs.
Requirements
More requirements can be found here:
The Proposal
NodeManger
component inside the Requester node that persists the state of compute nodes, managers their lifecycle, and expose APIs to allow operators to query and mutate the nodes.NodeManager
will also be responsible for queueing new evaluations for daemon jobs if new nodes join, or their labels are updated.NodeManager
shall persist stable info about compute nodes, such as their state, labels and supported engines, and store in-memory always changing metadata, such as available resourcesOpen Questions
RegisterNode
API that compute nodes shall call when they start up and want to register with a requester node, instead of just always publishing their info.