Use-case: Durable Workflow

mumoshu commented 6 years ago

Extracted from https://github.com/Azure/brigade/issues/125#issuecomment-370974394

First of all, I'm not saying that we'd need to bake a workflow engine into Brigade.

But I just wanted to discuss how we could achieve this use-case in both shorter term and longer term.

The interim solution can be implementing something on top of Brigade, and/or collaborating with other OSS projects.

Problem

Suppose a brigade.js to correspond to a "workflow" composed of one or more jobs, it can be said to be "durable" when it survives pod/node/brigade failures. This characteristic - "durability" - is useful when the total time required to run the workflow from start to finish is considerably long. And therefore, even when the brigade worker failed in the middle of the workflow, the restarted worker should continue the workflow from where it had failed.

This use-case is typically achieved via a so-called (durable) workflow engine.

Suppose a workflow engine as a stateful service to complete your DAG of jobs by supporting:

Retries / tolerating temporary brigade worker/job/node pod failure (Main concern in this issue)
Timeouts
Reliable notification on lifecycle events of builds and jobs.

From Brigade user's perspective, if brigade somehow achieved durability, no github PR status remains pending forever when something failed, no time-consuming job(like running an integration test suite) is rerun in case of restarting a workflow.

Possible solutions

I have two possible solutions in my mind today.

Although configuration of the integration would be a mess, I slightly prefer 2, which keeps Brigade doing one thing very well - scripting, not running durable workflows!

We can of course implement a light-weight workflow-engine-like thing inside each brigade gateway. But I feel like it is just a reinvention of the wheel.
We can also decide Brigade's scope to not include a durable workflow engine. Then, we can investigate possible integrations with another workflow-engine for providing durability to Brigade scripts.

I guess, the possible integration may end up including:

A brigade gateway to invoke a durable workflow. For example, submit a argo workflow.
A connector to allow external durable workflows to brig run $project -e $event_as_you_like.
- It should be user's reposibility to make a single brig run idempotent, so that the workflow can retry it whenever necessary.
- If the user don't want to retry the whole brigade script, he/she can freely split the script into multiple event handlers like events.on('step1', ...) and events.on('step2') so that the workflow can retry step1 and step2 independently.
For example, reliably marking a github PR status to be failed when a brigade script failed could be done on the workflow engine, not on brigade alone. The workflow invokes brig run -e buid_failed on a notification step, then brigade runs events.on('build_failed') to mark the PR status failed.

mumoshu commented 6 years ago

FWIW, I've shared a possible solution with Argo here.

blakestoddard commented 6 years ago

One thing in particular that I've found missing from Brigade is the ability to rerun failed builds. With CI tools like Buildkite, if a build fails because of an intermittent issue (say a DNS resolution fails and docker push doesn't work, causing the build to fail), I can hit "retry" and it does just that. With Brigade, that functionality doesn't exist -- the only way to retry a build is to commit something again.

I'm a big +1 on having the ability to retry builds. Whether that means full durability support is beyond me, but being able to retry things that have failed without having to find something to commit in the repo would be huge for me.

mumoshu commented 6 years ago

@blakestoddard Thanks for chiming in! Yeah, that's a good point.

From my perspective, implementing durability into brigade at build-level like that would make sense to me, assuming:

Implementation won't be that hard
Utilizing an external workflow engine just for this use-case seems unfriendly and a bit over-engineering although it is possible...

We currently mark each build as "processed"(hence not needed to be retried later) as long as brigade controller is able to "create" the pod. But it doesn't wait until the pod exists without an error.

Relevant code: https://github.com/Azure/brigade/blob/ae0685f9a57e5d0ceef2bf62d0773e5f68250d17/brigade-controller/cmd/brigade-controller/controller/controller.go#L80-L96

https://github.com/Azure/brigade/blob/9f314937250463d823d69a595f158aa89000ceca/brigade-controller/cmd/brigade-controller/controller/handler.go#L24-L65

I believe we can make syncSecret async and wrap it into a control-loop so that brigade periodically try to run any unfinished build without a on-going worker pod, until the build finally finishes with a successful status.

@technosophos @adamreese Would this change make sense to you, within the scope of Brigade?

Even if this feature existed, I'd prefer using another workflow engine for a complex workflow composed of multiple brigade scripts. My suggestion here covers "at-least-once build run" use-case only.

acham commented 5 years ago

Since a job run returns a promise, this works well for retries (tested): https://www.npmjs.com/package/promise-retry. The only catch is that the job name needs to be different for each attempt.

const promiseRetry = require('promise-retry');

const MAX_ATTEMPTS = 5;

let verizonPromise = promiseRetry((retry, number) => {
    console.log(`Verizon job attempt ${number}`);
    var verizonJob = new Job(`vz-attempt${number}`, 'my-verizon-image');

    return verizonJob.run().catch(retry);
  }, {
    retries: MAX_ATTEMPTS,
    factor: 1,
    minTimeout: 500
  });

verizonPromise.catch((err) => {
  console.error(`Verizon job unsuccessful after ${MAX_ATTEMPTS} attempts, aborting workflow`);
  console.error(err);
  process.exit(1);
  });

technosophos commented 5 years ago

Oh! That is an interesting strategy I had never considered! I wonder if it makes sense to include promise-retry as a core library for Brigade

acham commented 5 years ago

Probably. From the user's perspective, I think that having a property Job.attempts (1 by default) when defining the job would be a convenient interface. The other retry options seem to be designed mainly for HTTP calls. I can attempt a pull request.

technosophos commented 5 years ago

That would be good. I don't think jobs should do this by default, but I would love to see it as an easy add-on in pipelines for those cases where this is the desired behavior.

/cc @vdice

krancour commented 5 years ago

Retries asides, it seems there is also interest here in resuming a pipeline where it left off if a worker dies mid-pipeline. Is that right?

And therefore, even when the brigade worker failed in the middle of the workflow, the restarted worker should continue the workflow from where it had failed.

I've implemented this in other systems. Realistically, this cannot really be accomplished without major architectural changes and the introduction of a dependency on some kind of message-oriented middleware.

I am curious, however, how common an occurrence it is for workers to fail mid-pipeline. Is it frequent for you, and if so, do you know why? I'm in no way arguing against building for failure, but optimizations that introduce major architectural changes and significant new dependencies aren't things to enter into lightly, so I'm curious to see if we can get more bang for the buck by treating the root cause of worker failures.

acham commented 5 years ago

Generally speaking, much of the design of Kubernetes considers pods to be somewhat fleeting entities which can easily fail. Brigade, being designed for Kubernetes, would do well to also consider them as such.

radu-matei commented 5 years ago

Possibly related: #977

mumoshu commented 5 years ago

Dumping my memory as this issue was featured in today's Brigade mtg.

I think we have several things that can be done within the scope of this issue:

To help making your build idempotent as a whole, add a "checkpointing" helper to a place like brigade-utils.
- This can be a js func/class to wrap Job that first checks existence of specific K8s object(can be a custom resource like Checkpoint) for the unique key for the job. If it exists, consider it as already run, skip creating the Job pod and instead pull previous result from the K8s object.
```
// DurableJob? CheckpointedJob?
var test = new DurableJob(underlyingJob, {key: `${project}-${prNumber}-test`})
var res = test.run()
// `test.run` either create a pod as usual, or consult a `Checkpoint` resource whose metadata.name is `${project}-${prNumber}-test` if it exists, to obtain the `res`
```
  With this, you can rerun your build without worrying to much about duplicated results and too much wasted cluster resources
@acham's retry helper does help making your brigade script tolerate expected/voluntary transient failures. Compare to this, the 1 above is more high-level helper. Probably you can wrap a retried job in a checkpointer for completeness.
Manually detecting and rerunning builds failed due to transient errors are hard.

To automate it we can enhance builds(stored as K8s secrets) to include an additional field like expiration_date. We can write an another K8s controller that watches and reruns expired builds.

A build is considered as expired when expiration_date > current_date, and completion date is not set.

krancour commented 5 years ago

This can be a js func/class to wrap Job that first checks existence of specific K8s object(can be a custom resource like Checkpoint) for the unique key for the job.

I haven't thought through this all the way yet, but instead of introducing a new kind of resource type (at the moment, Brigade doesn't use any CRDs) or even a new resource of an existing type (e.g. "checkpoint" encoded in a secret) we should think about what kind of job status can already be inferred from existing resources. Job pods, for instance, stick around after completion. So, could it possibly be enough that when a worker goes to execute a given job for a given build, it checks first to see if such a pod already exists? If it exists and has completed, some status can be inferred. If it exists and is still running, it could wait for it to complete as if it had launched the job itself. If it doesn't exist, then go ahead and launch it.

Again, I haven't thought through all the details here. My suggestion is just to see what kind of mileage we can get out of all the existing resources in play before adding any new ones to the mix.

krancour commented 4 years ago

Closing this. Please see rationale in https://github.com/brigadecore/brigade/issues/995#issuecomment-642196417.

krancour commented 4 years ago

I am re-opening this issue because after ruling this out of scope for the forthcoming Brigade 2.0 proposal, due to technical constraints, I've discovered a realistic avenue to achieving if we accept a minor compromise.

There have been two big technical limitations at work here-- one being that Brigade itself doesn't understand your workflow definitions (only the worker image does-- and those are customizable) and the other being that restoring shared state of the overall workflow to a correct / consistent state prior to resuming where a workflow left off was also not a realistic possibility without first relying on some kind of layered file system (a very big undertaking).

These can be addressed by imposing two requirements on projects that wish to take advantage of some kind of "resume" functionality-- 1. it works for "stateless" workflows only (e.g. those that do not involve a workspace shared among jobs; externalizing state is ok) and 2. projects have to opt-in to the "resume" functionality. Under these conditions, we could safely retry handling of a failed event and whilst doing so bypass any job whose status is already recorded as succeeded.

shmargum commented 4 years ago

Not sure if this really belongs here, but in addition to being able to restart failed pipelines it would be nice to be able to re-run successful pipelines. Having this functionality exposed via the Kashti UI might also be nice :)

I am more interested in being to easily restart entire pipelines since I expect that I can just chain pipelines to achieve some intermediate checkpoint if I want

krancour commented 4 years ago

This is well covered by the 2.0 proposal, which has been ratified and is now guiding the 2.0 development effort. It probably doesn't make sense to track this as a discrete issue anymore.

brigadecore / brigade

Use-case: Durable Workflow #369

Problem

Possible solutions