bacalhau-project / bacalhau

Compute over Data framework for public, transparent, and optionally verifiable computation
https://docs.bacalhau.org
Apache License 2.0
649 stars 86 forks source link

The requester node attempts to schedule work on disconnected nodes resulting in the job never running #3784

Closed frrist closed 2 months ago

frrist commented 3 months ago

Due to changes here:

Current proposal is to:

cc @rossjones & @wdbaruni to weigh in on how the new event system introduced in https://github.com/bacalhau-project/bacalhau/pull/3772 can be used to force scheduling of executions when offline compute nodes come online again.

wdbaruni commented 3 months ago

What is the proposal here? I believe the default option now is to auto-approve nodes, and only schedule on approved and connected nodes. Is any of that still missing?

frrist commented 3 months ago

Yeah the only scheduling on connected and approved nodes is missing. Currently we schedule on disconnected node for some job types and ignore their approval state for other job types. frankly it's a bit of a mess:

Or rather than modify, allow these aspects of scheduling to be configured.

Further we need to ensure that worked scheduled on an offline node runs when the node comes back online which we will need https://github.com/bacalhau-project/bacalhau/pull/3772 to do. e.g. the orchestrator could listen for connected events and create an evaluation to execute the work.

frrist commented 2 months ago

Another point to consider: How can we allow users to define different scheduling heuristics for compute nodes. e.g. nodes in a data center ought to have a more strict requirement on connectedness than nodes that are expected to go offline for longer periods of time (e.g. submarine compute nodes)

rossjones commented 2 months ago

@wdbaruni previously suggested adding another (a third) timeout in future which allows nodes to be offline for that long before being considered dead.