Job store garbage collection

Today we are retaining all submitted jobs information in the datastore indefinitely, which is slowing down bacalhau list operations.

We need to provide a garbage collector that should delete all completed job information, including their executions, evaluations and history events after a configurable JobGCInternal window from their completion time.

The garbage collector should be agnostic of the backend implementation of the job store so we can reused it with different implementations, such as with NATS KV

We should also explore different GC configurations for different models. For example:

Evaluations are just triggers to evaluate jobs and can be deleted in just few hours after they have been processed even if the job is not completed yet
History events may grow our of hand, specially when we add job updates or have more use of long running jobs
Rejected or failed executions can be deleted even before the job is marked as completed if the job is a long running job

Proposal:

We can limit the focus of this issue to implement GC for jobs and evaluations, and then open a follow-up issue to think about the best course of action to manage the state of history events and executions. Maybe compaction of those events can be a better approach than just deleting them based on their age.

The configurations should look like:

  # Job store configuration
  StateStore:
    JobGCInterval: 10m
    JobGCThreshold: 30M
    EvalGCThreshold: 1h
    Backend:
      Type: BoltDB
      Config: {} # config related to the backend type

bacalhau-project / bacalhau

Job store garbage collection #4174

Proposal: