bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
641 stars 85 forks source link

Nodes Manager #3945

Open wdbaruni opened 5 months ago

wdbaruni commented 5 months ago

The Problem

Today we don't have a dedicated component in the Requester node that manages the compute nodes in the network. We only have a transient NodeInfoStore that discovers compute nodes through gossiping, and evict nodes that stop heartbeating for over 10 minutes. This state is transient, will be lost when the Requester node is restarted, and not possible for operators to interact with, such as draining or manually marking a node as unhealthy.

In addition to that, we don't have a mechanism today to schedule daemon jobs on newly joined compute nodes that match the job selection criteria of existing daemon jobs, or when a compute node changes its labels to match or no longer match existing daemon jobs.

Requirements

More requirements can be found here:

  1. As an operator, I want to have the power to manually approve compute nodes before they join the network, ensuring better control and security.
  2. As an operator, I need a feature that allows for the auto-joining of pre-authorized compute nodes, eliminating the need for manual intervention.
  3. As an operator, I want to mark certain nodes as ineligible for new job placements, while existing jobs on these nodes can continue to run uninterrupted.
  4. As an operator, I want to drain nodes when necessary and have Bacalhau re-schedule the jobs to other nodes.
  5. As an operator, I want the ability to add labels to nodes from the control plane and override any labels that the compute nodes have set themselves.
  6. As an operator, I need to set the criteria for timeouts and missed heartbeats that will determine when a node is considered in 'unknown' state.
  7. As an operator, I need to set the criteria for timeouts and missed heartbeats that will determine when a node is considered 'unhealthy'.
  8. As an operator, I want the flexibility to set these timeout and heartbeat thresholds at individual node levels, for groups of nodes, or as default settings for any nodes not specified.
  9. As a user, I expect Bacalhau to automatically reschedule 'batch' and 'service' jobs if their nodes are marked as unhealthy.
  10. As a user, I want Bacalhau to mark the execution of 'ops' jobs as failed if their nodes are flagged as unhealthy.
  11. As a user, I want Bacalhau to schedule daemon jobs on new compute nodes that join the network and match selection criteria of existing jobs
  12. As a user, I want Bacalhau to evict daemon jobs on existing compute nodes if their labels are updated to no longer match existing jobs

The Proposal

Open Questions

  1. Evaluate the need for an explicit RegisterNode API that compute nodes shall call when they start up and want to register with a requester node, instead of just always publishing their info.
### Tasks
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/566
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/567
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/568
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/569
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/573
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/570
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/571
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/572
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/574
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/575
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/576
- [ ] https://github.com/bacalhau-project/bacalhau/issues/3827
- [ ] https://github.com/bacalhau-project/bacalhau/issues/3826
- [ ] #3825
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/580
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/581
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/587
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/613
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/620
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/671
- [ ] https://github.com/bacalhau-project/expanso-planning/issues/708
wdbaruni commented 5 months ago

Might be a pre-requirement for bacalhau-project/expanso-cloud#2 and to support bacalhau-project/expanso-cloud#7