Closed mumoshu closed 4 years ago
One thing in particular that I've found missing from Brigade is the ability to rerun failed builds. With CI tools like Buildkite, if a build fails because of an intermittent issue (say a DNS resolution fails and docker push
doesn't work, causing the build to fail), I can hit "retry" and it does just that. With Brigade, that functionality doesn't exist -- the only way to retry a build is to commit something again.
I'm a big +1 on having the ability to retry builds. Whether that means full durability support is beyond me, but being able to retry things that have failed without having to find something to commit in the repo would be huge for me.
@blakestoddard Thanks for chiming in! Yeah, that's a good point.
From my perspective, implementing durability into brigade at build-level like that would make sense to me, assuming:
We currently mark each build as "processed"(hence not needed to be retried later) as long as brigade controller is able to "create" the pod. But it doesn't wait until the pod exists without an error.
I believe we can make syncSecret
async and wrap it into a control-loop so that brigade periodically try to run any unfinished build without a on-going worker pod, until the build finally finishes with a successful status.
@technosophos @adamreese Would this change make sense to you, within the scope of Brigade?
Even if this feature existed, I'd prefer using another workflow engine for a complex workflow composed of multiple brigade scripts. My suggestion here covers "at-least-once build run" use-case only.
Since a job run returns a promise, this works well for retries (tested): https://www.npmjs.com/package/promise-retry. The only catch is that the job name needs to be different for each attempt.
const promiseRetry = require('promise-retry');
const MAX_ATTEMPTS = 5;
let verizonPromise = promiseRetry((retry, number) => {
console.log(`Verizon job attempt ${number}`);
var verizonJob = new Job(`vz-attempt${number}`, 'my-verizon-image');
return verizonJob.run().catch(retry);
}, {
retries: MAX_ATTEMPTS,
factor: 1,
minTimeout: 500
});
verizonPromise.catch((err) => {
console.error(`Verizon job unsuccessful after ${MAX_ATTEMPTS} attempts, aborting workflow`);
console.error(err);
process.exit(1);
});
Oh! That is an interesting strategy I had never considered! I wonder if it makes sense to include promise-retry
as a core library for Brigade
Probably. From the user's perspective, I think that having a property Job.attempts
(1 by default) when defining the job would be a convenient interface. The other retry options seem to be designed mainly for HTTP calls.
I can attempt a pull request.
That would be good. I don't think jobs should do this by default, but I would love to see it as an easy add-on in pipelines for those cases where this is the desired behavior.
/cc @vdice
Retries asides, it seems there is also interest here in resuming a pipeline where it left off if a worker dies mid-pipeline. Is that right?
And therefore, even when the brigade worker failed in the middle of the workflow, the restarted worker should continue the workflow from where it had failed.
I've implemented this in other systems. Realistically, this cannot really be accomplished without major architectural changes and the introduction of a dependency on some kind of message-oriented middleware.
I am curious, however, how common an occurrence it is for workers to fail mid-pipeline. Is it frequent for you, and if so, do you know why? I'm in no way arguing against building for failure, but optimizations that introduce major architectural changes and significant new dependencies aren't things to enter into lightly, so I'm curious to see if we can get more bang for the buck by treating the root cause of worker failures.
Generally speaking, much of the design of Kubernetes considers pods to be somewhat fleeting entities which can easily fail. Brigade, being designed for Kubernetes, would do well to also consider them as such.
Possibly related: #977
Dumping my memory as this issue was featured in today's Brigade mtg.
I think we have several things that can be done within the scope of this issue:
Job
that first checks existence of specific K8s object(can be a custom resource like Checkpoint
) for the unique key for the job. If it exists, consider it as already run, skip creating the Job pod and instead pull previous result from the K8s object.
// DurableJob? CheckpointedJob?
var test = new DurableJob(underlyingJob, {key: `${project}-${prNumber}-test`})
var res = test.run()
// `test.run` either create a pod as usual, or consult a `Checkpoint` resource whose metadata.name is `${project}-${prNumber}-test` if it exists, to obtain the `res`
With this, you can rerun your build without worrying to much about duplicated results and too much wasted cluster resources
Manually detecting and rerunning builds failed due to transient errors are hard.
To automate it we can enhance builds(stored as K8s secrets) to include an additional field like expiration_date
. We can write an another K8s controller that watches and reruns expired builds.
A build is considered as expired when expiration_date > current_date, and completion date is not set.
This can be a js func/class to wrap Job that first checks existence of specific K8s object(can be a custom resource like Checkpoint) for the unique key for the job.
I haven't thought through this all the way yet, but instead of introducing a new kind of resource type (at the moment, Brigade doesn't use any CRDs) or even a new resource of an existing type (e.g. "checkpoint" encoded in a secret) we should think about what kind of job status can already be inferred from existing resources. Job pods, for instance, stick around after completion. So, could it possibly be enough that when a worker goes to execute a given job for a given build, it checks first to see if such a pod already exists? If it exists and has completed, some status can be inferred. If it exists and is still running, it could wait for it to complete as if it had launched the job itself. If it doesn't exist, then go ahead and launch it.
Again, I haven't thought through all the details here. My suggestion is just to see what kind of mileage we can get out of all the existing resources in play before adding any new ones to the mix.
Closing this. Please see rationale in https://github.com/brigadecore/brigade/issues/995#issuecomment-642196417.
I am re-opening this issue because after ruling this out of scope for the forthcoming Brigade 2.0 proposal, due to technical constraints, I've discovered a realistic avenue to achieving if we accept a minor compromise.
There have been two big technical limitations at work here-- one being that Brigade itself doesn't understand your workflow definitions (only the worker image does-- and those are customizable) and the other being that restoring shared state of the overall workflow to a correct / consistent state prior to resuming where a workflow left off was also not a realistic possibility without first relying on some kind of layered file system (a very big undertaking).
These can be addressed by imposing two requirements on projects that wish to take advantage of some kind of "resume" functionality-- 1. it works for "stateless" workflows only (e.g. those that do not involve a workspace shared among jobs; externalizing state is ok) and 2. projects have to opt-in to the "resume" functionality. Under these conditions, we could safely retry handling of a failed event and whilst doing so bypass any job whose status is already recorded as succeeded.
Not sure if this really belongs here, but in addition to being able to restart failed pipelines it would be nice to be able to re-run successful pipelines. Having this functionality exposed via the Kashti UI might also be nice :)
I am more interested in being to easily restart entire pipelines since I expect that I can just chain pipelines to achieve some intermediate checkpoint if I want
This is well covered by the 2.0 proposal, which has been ratified and is now guiding the 2.0 development effort. It probably doesn't make sense to track this as a discrete issue anymore.
Extracted from https://github.com/Azure/brigade/issues/125#issuecomment-370974394
First of all, I'm not saying that we'd need to bake a workflow engine into Brigade.
But I just wanted to discuss how we could achieve this use-case in both shorter term and longer term.
The interim solution can be implementing something on top of Brigade, and/or collaborating with other OSS projects.
Problem
Suppose a
brigade.js
to correspond to a "workflow" composed of one or more jobs, it can be said to be "durable" when it survives pod/node/brigade failures. This characteristic - "durability" - is useful when the total time required to run the workflow from start to finish is considerably long. And therefore, even when the brigade worker failed in the middle of the workflow, the restarted worker should continue the workflow from where it had failed.This use-case is typically achieved via a so-called (durable) workflow engine.
Suppose a workflow engine as a stateful service to complete your DAG of jobs by supporting:
From Brigade user's perspective, if brigade somehow achieved durability, no github PR status remains pending forever when something failed, no time-consuming job(like running an integration test suite) is rerun in case of restarting a workflow.
Possible solutions
I have two possible solutions in my mind today.
Although configuration of the integration would be a mess, I slightly prefer 2, which keeps Brigade doing one thing very well - scripting, not running durable workflows!
We can of course implement a light-weight workflow-engine-like thing inside each brigade gateway. But I feel like it is just a reinvention of the wheel.
We can also decide Brigade's scope to not include a durable workflow engine. Then, we can investigate possible integrations with another workflow-engine for providing durability to Brigade scripts.
I guess, the possible integration may end up including:
brig run $project -e $event_as_you_like
.brig run
idempotent, so that the workflow can retry it whenever necessary.events.on('step1', ...)
andevents.on('step2')
so that the workflow can retry step1 and step2 independently.brig run -e buid_failed
on a notification step, then brigade runsevents.on('build_failed')
to mark the PR status failed.