Job Progress Visibility

wdbaruni commented 7 months ago

The Problem

Today bacalhau docker run and bacalhau job run have very limited visibility of the job’s progress, node selection, possible retries and where things might be failing. A big part for that is because the commands are relying on job and execution state to print the progress, instead of the job history events which were recently enriched with more details and hints https://github.com/bacalhau-project/bacalhau/pull/3771

The Proposal

The proposal is to use job level and execution level history events to display the progress. There are some missing features needed today to achieve this, including:

Implement pagination for querying job history
Figure out a nice way to display these events on the console. Maybe group events by execution and only display the latest event for each execution.
In the future we can implement this using websockets when we have better datastore event watcher logic that is not only limited to live events.

Priority

Priority: 1 (Hight)
Quarter: 2024/Q1
Justification:
- Painful user experience while working on the use cases to understand why jobs are failing or not being scheduled.
```
### Tasks
```
[ ] https://github.com/bacalhau-project/bacalhau/issues/4177
[ ] https://github.com/bacalhau-project/bacalhau/issues/4220
[ ] https://github.com/bacalhau-project/bacalhau/issues/4233

frrist commented 6 months ago

Chain evaluations so that we can always reach child evaluations from a parent one.

I am not sure this is feasible given our current approach for creating evaluations - since an evaluation for a job can be submitted at nearly any point in jobs life-cycle - and the current evaluation for the job is not known at this point. This makes constructing the lineage challenging without repeated scans of the job store to find the current evaluation.

One potential solution involves scanning the job store for all evaluations each time a new one is introduced, but this approach is likely to experience diminishing performance with each additional evaluation, which doesn't seem sustainable.

wdbaruni commented 6 months ago

Yeah we should avoid solutions that require scanning the data store. I understand the complexities here, specially that we are trigger multiple executions per evaluation and forking the lineage path.

Though if we take a step back, we are mainly interested in the evaluation that was created when the job was submitted, along with its executions and any evaluation created while trying to bring those executions to a running or completed state. Meaning evaluations that are coming from totally different sources, such as bacalhau job stop or by the upcoming bacalhau-project/bacalhau#3945 are outside of the lineage we want to track, and should collisions gracefully, such as stopping a job while still tracking its submission.

Submit Job

When a user submits job we create an evaluation and return it to the caller. From there we can track the progress and lineage as follows:

If the state of the evaluation is still pending, it means it hasn't been processed yet and we sleep a little
If the state is completed, it means there should either be a follow-up evaluation because scheduling this one failed, or we should see new executions created where the state of these executions are AskForBid.
After this point, we have two options:
Option 1: Track the execution state:
Now we should be in a state where we track the executions instead of evaluations, where we exit when executions are in terminal state for batch jobs or running state for long running jobs
The trick part is these executions might fail and will be replaced by other executions. In this case we can encode a FollowupEvalID in the execution where the FollowUpEvalID is the evaluation that is responsible for handling the execution failure and should have the decision on whether more executions were created, a delayed evaluation was enqueued, or if the job was failed.

Option 2: Maintained evaluation lineage

When the compute nodes trigger OnBidComplete, the requester will create a new evaluation along with the previous evalID as the source. The requester will fetch the evalID from the execution object itself.

Note:

On both options, the tricky part is the evaluation broker does deduplication of evaluations for the same job. We need to update the state of the deduped evaluations to cancelled. Maybe we need to encode the evalID that was processed instead of the deduped ones, but maybe that not necessary. I am saying that because if we asked three nodes to execute a job, three evaluations will be created with the same parent, and just following one path will be enough. If an evaluation came from another source, such as stopping a job, then all execution evaluations will be cancelled. That will help the tracker to reach a terminal state and we can decide what to do with that information .

Stop Job

Similar process, but the tracker wills top when there are no more follow up evaluations and when the number of active executions reach 0

Database Design

In all options, we don't need a relational database and tracking the lineage can be done with the same existing index design, and possibly without additional indexes.

frrist commented 6 months ago

A couple questions to clarify my understanding of the proposal:

For option 1: Do I understand correctly that this proposes a polling mechanism on the job store? That is, from within the requester node:

Submit a job
Poll job store by EvalID for the evaluation state (GetEvaluation) 2.a If eval.state == pending goto 2. 2.b If eval.state == complete set EvalID to FollowupEvalID goto 2, 2.c Else use jobID from eval to fetch the list of executions for job, filter to one with EvalID, continue
Poll job store for execution state (requires implementation of GetExecution + Index) 3.a if exe.state != terminal goto 3. 3.b if exe.state == failure, look up eva ID from exe.FollowUpEvalID goto 2 (not sure on this part)

For options 2: This is basically 1 and 2 from above then waiting for a OnBidComplete event and linking up the previous evalID. Which isn't guaranteed to be "true" lineage of all evaluation states between the time a job was submitted and the time the job was completed, since there could be N failed evaluations in between - I think?

I can't help but question what the performance impact will be on the job store given all of this polling - especially when the overhead/strain of multiple distributed requester nodes is added. But we can put a pin in that for later.

if we asked three nodes to execute a job, three evaluations will be created with the same parent, and just following one path will be enough

Is that enough? What if two of the compute nodes fail and the one we follow succeeds, or visa versa? Couldn't that lead to an incomplete view of the actual job state?

Similar process, but the tracker wills top when there are no more follow up evaluations and when the number of active executions reach 0

I see, so the implicit proposal here is to create a "tracker-thingy" that polls the job-store and attempts to construct some lineage of the job+eval+execution state?

If I understand the above correctly, and since we are not allowed to use relational database to solve this problem. I'd like to propose a third option of using the JobStores event API to subscribe to events related to a given job. This should be relatively straight forward given our current implementation supports watching for different events. As does the expected superseding implementation.

The challenging part here is subscribing to events for a job before it's created so that all may be captured and exposed to the client as they are observed. My initial sketch was implementing an API like orchestrator/jobs/{job_id}/watch (and job watch {job_id}) that we call on the client just before submitting the job. However, since the client is unable to know the jobID ahead of time this won't work. We could of course subscribe after creating the job, but that's susceptible to missing events, or requiring conditional buffering on the Requester - which is complicated. Do you have any suggestions for this API? What could we use instead of a JobID?

wdbaruni commented 2 months ago

@udsamani I've updated the issue description and changed the proposal to use job history instead of evaluations

bacalhau-project / bacalhau