Ability to re-trigger failed build with the same input versions

davewalter commented 8 years ago

When using the new version: every configuration for a get task, it is possible to arrive at a state where you have multiple builds of the same job running at the same time. If an earlier build fails, there is no way to re-trigger it with the same set of inputs. We haven't been able to determine a useful workaround; setting serial: true doesn't really help in this scenario, because the next build will start as soon as the first one fails.

It would be helpful if there were a way to re-trigger the job with the same inputs as a particular build (failed or otherwise).

Let us know if you need more details on this scenario or our desired fix. Thanks!

@davewalter and @rmasand

vito commented 8 years ago

We're thinking about splitting today's trigger build button (+). It is primarily used for three things today:

Impatience: I just pushed something or know that someone just published something, and I want the build to run now.
Retrying a build (this issue): I want to re-run the current build, either to see if it's flaky or to retry because something outside the build failed (e.g. github, a deployment, etc.).
Triggering a job that only ever manually runs, e.g. shipping a product after you've written release notes.

The flaw with case 1 is that there's a race condition. In the time between you loading the page and clicking the +, Concourse may have already found your stuff and queued a build. Now you have two, which is annoying.

The flaw with case 2 is you can only do it with the latest build, and also if you triggered a bunch, new versions may come in, potentially invalidating your flakiness trial. You could set version to a particular version in your pipeline, but that's annoying.

Case 3 pretty much works, but you don't know what versions it'll use until you run it. See https://github.com/concourse/concourse/issues/269

So, I think we should split + into two buttons. One that lives on the job, "sync", which will make sure everything's up-to-date and then queue up a build if it should (i.e. one's not queued already; same semantics as auto-triggering). The other button would be associated with a particular build of the job, and would re-trigger it with the same inputs. This covers cases 1 and 2.

The third case needs some more thinking since a "sync" button alone doesn't intuitively seem like enough given that the build only manually triggers.

endzyme commented 8 years ago

+1 this would be a great, and much needed, feature for Concourse CI

charlieoleary commented 7 years ago

+1 As well, this is a pretty critical feature. We've sort of circumvented it with empty commits (since we're using the PR resource), but it would be ideal to simply retry a failed job with the same inputs.

ahelal commented 7 years ago

Would love to see that. We are doing crazy stuff to try to trigger old commits. Is this open for external people to help ? and if so how ?

primalmotion commented 7 years ago

same thing here, I feel that concourse has everything to be able to do this fairly easily. It would work perfectly with the pull request resource.

tracker-common commented 7 years ago

+1000000 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍👍 👍👍👍 👍👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍👍 👍👍👍 👍👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍👍 👍👍👍 👍👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍👍 👍👍👍 👍👍

gabro commented 7 years ago

Same here, if you guys are busy can we help somehow?

miromode commented 7 years ago

+1

kei-yamazaki commented 7 years ago

👍

timrchavez commented 7 years ago

:+1: This would go a long way to making Concourse a more viable choice for us.

ls-yann-david commented 7 years ago

Pleaseeeeee 👍

VanAxe commented 7 years ago

Retrigerring jobs would be a lot more elegant than empty commits! :stuck_out_tongue_closed_eyes: :+1:

olhtbr commented 7 years ago

👍

tsantero commented 7 years ago

is this issue currently being actively worked on internally? I need this asap, but also don't want to duplicate work. similarly, are there any contributor guidelines I should read?

jtarchie commented 7 years ago

@tsantero, there are contributing guidelines.

clarafu commented 7 years ago

We noticed that a lot of people that want this feature are using the PR resource. We created a separate issue for some remodelling for multi-branch workflows in #1172 and it would be helpful if people that are interested could leave some comments on it, so we can better understand the need for both, and also figure out who needs this current issue for other reasons. Your feedback will determine which area we opt to focus on first.

davewalter commented 7 years ago

The original reason for this request was that we manage multiple "environments" (pool resource) that we use to deploy CF, and we have an automated pipeline that cleanses and prepares each environment for use in other pipelines. At any given time, we could have multiple environments going through the same pipeline and, depending on timing and/or build failures, environments can get "lost" in the pipeline as they have been overtaken by later builds. At that point, our only option is to manually un-claim the environment in the pool and let it start from the beginning again. Ideally, in the case of a failure, we would be able to re-trigger the failed job as it was originally run.

The bigger problem as I see it is how to solve the issue of builds of the same job running and the one that started second finishing before the one that started first. In this case, even with version: every turned on, the first build never triggers the next job in the pipeline. We would also get into this situation if the first job failed and we were able to re-trigger it to get it to go green.

jtarchie commented 7 years ago

@clarafu, I'm with @davewalter. I think these are two separate issues.

The workflow of the feature branches does not necessarily solve the problem of retriggering a build. It would be a nice to have along with it.

@davewalter has a great use case where the idempotent explicitness of concourse doesn't work for automation.

That being said, as the creator of the PR resource, this issue still has my vote. #1172 seems very use case specific.

joaogbcravo commented 7 years ago

@clarafu I'm not using concourse yet, because of this particular issue.

I want to use concourse for continuous delivery. I would like to have a "button" to deploy to production. If later I need to do a rollback I would like to re-trigger an old successful "build", using all the versions used on it, to redeploy that same versions on production again.

This "feature" exists in many other CI/CD tools, and for someone (like me) that want to migrate from them to concourse, it's kind of a deal breaker.

thewoolleyman commented 7 years ago

@clarafu Point number 2 mentioned in the comment above (retrying flaky or failed build steps) is a primary reason this functionality is needed independent of any other feature to address multiple branch handling.

Sometimes build steps just fail intermittently and you need to re-run them. This isn't a defense of flaky builds; you may be actively working on addressing the root cause of the intermittency or flakiness, but can't afford to have your pipeline grind to a halt in the meantime. This is especially true of very large and mature browser integration test suites, which can suck up limitless pair weeks trying to squash all flakiness, and in some cases it's not even something you can fix (e.g. newly-introduced browser or webdriver bugs).

If you have pipelines with many large, long-running, highly-parallelized test suites, having to re-run the entire pipeline if one single spec flakes out is unacceptable - especially if it's currently failing somewhat frequently. In this scenario, Concourse seems unusable when compared to other CI tools which have this support - where you can just retry the single step, and the pipeline continues on when it passes.

clarafu commented 7 years ago

For those who haven't seen @vito 's comment in #1172 , I'll sort of reiterate what vito said. We created #1172 instead of just implementing the ability to re-trigger a single build because looking beyond that one retriggered job, things get pretty confusing. For example, re-triggering a build in the middle of the pipeline won't result in the rest of the pipeline running if it's already run with a more recent PR. Even if you have applied version: every, the current pipeline semantics will not let you go back in time and rerun older versions than what's already been tested. @vito has a more detailed explanation here if you still have questions: https://github.com/concourse/concourse/issues/1172#issuecomment-307224634.

To that point, we are trying to draw up a better solution with #1172 that involves sort of "instances" of a pipeline forked off by every build of a job. This will happen, for example, if you specify forked: true on a job, which will create a new instance of a pipeline for every build that runs from that job and pin down the versions of resources that the build started with. That way, if the job fails in that pipeline instance, you can re-trigger the job and it will run with the pinned down set of inputs and run the rest of the jobs in the forked instance after it goes green. Only the version of the inputs to the "forked" job will be pinned down, and the rest of the jobs will determine their inputs as normal (i.e. latest available candidates). Therefore you would use passed constraints to pin anything beyond the first job, just like you would do in a regular pipeline. In addition, trigger would only apply to resources that came out of the "forked" job, to prevent old instances from constantly running.

This is currently only our initial draft idea, it still needs a lot of adjusting but we'd like to know if this would satisfy the use cases for this issue?

joaogbcravo commented 7 years ago

@clarafu Thanks for the heads up.

In my understanding, the way you are thinking on this will not solve the use case that I expose in https://github.com/concourse/concourse/issues/413#issuecomment-306932392

Correct me if i'm wrong.

clarafu commented 7 years ago

@joaogbcravo I think it would because if you specify forked: true on the job that does the deploy to production, it will create an instance for every build of that job and you can retrigger a previous successful build from that forked instance (and it will use the same versions of inputs from that previous successful build). Does that make sense or is that just as confusing haha?

joaogbcravo commented 7 years ago

@clarafu It makes sense. I thought you was limiting this to failed builds with what you said "That way, if the job fails".

Would be nice if you can specify another property for all the "downstream" jobs of the "forked" job that. if a forked job runs, the following dowstream jobs with also be retriggered with the input resources from the forked job run outputs updated. It makes sense?

clarafu commented 7 years ago

Oh if those downstream jobs have already ran then if you retrigger a successful build, the downstream jobs will not retrigger automatically again. It seems like your use case if more closely related to #92? I think even if we supported #92, it wouldn't be related to or dependent on our proposal here. If you like you can present your use case in another issue, but judging by #92 we can't promise much. :(

joaogbcravo commented 7 years ago

Well I would say it's a mix.

leshik commented 6 years ago

What is the workaround can be for now, except for pushing empty commits?

ships commented 6 years ago

@leshik i have had some success in past leveraging the "disable version" feature in concourse -- if you thread your source code along with any assets generated in the pipeline, and the deploy job does a "get" of it with a passed constraint even if the deploy only needs the assets themselves, the deploy will only use versions of the asset built along with the non-disabled versions of source code. this is not super convenient in the UI but it is representative of the thing you're trying to state-- "i have discovered a problem with this deploy and i need to reject this version as legitimate". Then when your deploy kicks again, it will choose the previous version.

there are downsides to this of course as pipeline complexity increases, so it's not really a scalable solution.

da1nerd commented 6 years ago

@leshik for now I'm disabling versions of the resource in concourse then triggering the build again. e.g.

In this particular case I wanted to disable all current PR's except for one. However, you only need to disable the most recent versions before the version you want to keep (top down).

Now I can trigger my pipeline again with the + button and I get the expected input.

arwineap commented 6 years ago

I wrote a gem to help query against the api; then some scripts to automate the process of pausing jobs, resource versions, then triggering and unpausing

Hopefully spaces will ease our pains on this front

simonjohansson commented 6 years ago

@arwineap I would be very interesting to have a look at your scripts. We desperately need this feature

anneschuth commented 6 years ago

We'd really really like this. Is there any progress on this?

arwineap commented 6 years ago

@simonjohansson realized that my existing job did this a different way, but given that half the work was done I made another script https://gist.github.com/arwineap/3ce8a4c4084b33cc5fd527c871d42c1a

I run on 3.14, but I think it should work on updated versions too. It depends on having basic auth enabled, and the following upstream api endpoints:

GetBuildPlan
GetJobBuild
PauseResource
ListResourceVersions
DisableResourceVersion
CreateJobBuild
GetBuild
EnableResourceVersion
UnpauseResource

anneschuth commented 6 years ago

Thanks @arwineap! Where would you run this script? From the cli on your dev machine?

arwineap commented 6 years ago

Yes, anywhere with http access to the atc endpoint; there's also some environment variables you need to set that I apparently didn't mention

concourse_url
concourse_team
concourse_user
concourse_pass

pipeline_name
job_name
job_number

leshik commented 6 years ago

In v4 there is no more basic auth, so the last working version is 3.14.

vito commented 6 years ago

@leshik you can still log in with user/password non interactively:

fly login -u foo -p bar

But the script probably needs to be updated.

arwineap commented 6 years ago

Updated the gem to support local user auth in concourse 4, the script should be working

linusguan commented 5 years ago

The other button would be associated with a particular build of the job, and would re-trigger it with the same inputs.

This is incredibly useful. We will be able to re-deploy or roll back with this functionality.

vito commented 5 years ago

This is top-of-the-backlog now as we'll need it for https://github.com/concourse/concourse/issues/3602. The days of pinning and re-triggering and forgetting to un-pin are numbered!

Note: we might go for a quick-and-dirty version of this which directly replaces the build being re-triggered. In the future we'll want to keep track of each run of the build, but for the sake of unblocking #3602 quickly I think we should just start with this minimum viable solution as it has little to no implications on the UI/build ordering/etc.

vito commented 5 years ago

Notes from IPM:

Preserve inputs on re-triggered build (duh) - make sure the scheduler/build starter doesn't re-compute them or use next_build_inputs
We'll need a button in the UI for re-triggering (@Lindsayauchin), in addition to the existing trigger-build button
We'll need to clear out the build events for the old build and "re-set" the build back to pending state (and reset whatever other build data is appropriate, i.e. start/end time)
Also: clear out any outputs for the build, otherwise a re-trigger from green to red could result in "phantom outputs" satisfying passed constraints
Re-compute build plan based on current pipeline config
- If anyone is worried about this let us know - it's way easier to implement this way, and we figure there may be cases where a pipeline config change was made to fix the errant build anyway, in which case we'd want to pick up the new config.

We'll also spike on creating a new build instead of replacing it. They may actually be roughly the same difficulty. If we do this instead, we don't have to reset anything or clear out any outputs/etc.

StevenArmstrong commented 5 years ago

Could the trigger buttons be moved out from under the job and put on the main page somehow? It could have a trigger icon with a prompt or something? I say this as one of the main feedbacks our product lead gets of our concourse pipelines is around users calling the interface awkward and complaining they have to click through the green/red square of a job to trigger it. To mitigate this we have had to have a job called trigger for production deployments so users know how to trigger a production deployment. As a result of the clicking around we have even had requests to build a UI on top of concourse to make it easier and more dev friendly for roll forward and rollback triggers :(

vito commented 5 years ago

@StevenArmstrong So having a specific 'retrigger' button on failed builds in the pipeline? Sounds reasonable, though I would argue people should probably be clicking into the build first and understanding the failure rather than blindly re-triggering. :thinking: In any case we may want to discuss that as a separate issue that we can address after we take our first crack at this. :)

hstenzel commented 5 years ago

One question I have is how we can retrigger a build if we no longer have the log from the original?

vito commented 5 years ago

@hstenzel Build logs are purely cosmetic, the actual information regarding which versions/etc. are used is kept in the database and so re-triggering will still work.

hstenzel commented 5 years ago

Perhaps I'm missing something then. How would I retrigger a build for which I no longer have logs if the UI element is on the log screen?

On Mon, May 13, 2019, 2:58 PM Alex Suraci notifications@github.com wrote:

@hstenzel https://github.com/hstenzel Build logs are purely cosmetic, the actual information regarding which versions/etc. are used is kept in the database and so re-triggering will still work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/concourse/concourse/issues/413?email_source=notifications&email_token=ABOOX5Z275ZG2V2MKEHPXW3PVG23TA5CNFSM4CCWGJC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVJHUGA#issuecomment-491944472, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOOX52QDUJWEQEVIGCHT3DPVG23TANCNFSM4CCWGJCQ .

vito commented 5 years ago

The only thing removed when logs are reaper are the build logs below the build header. The rest of the ui is still there and shows the build status and such.

On Mon, May 13, 2019, 5:54 PM Harley Stenzel notifications@github.com wrote:

Perhaps I'm missing something then. How would I retrigger a build for which I no longer have logs if the UI element is on the log screen?

On Mon, May 13, 2019, 2:58 PM Alex Suraci notifications@github.com wrote:

@hstenzel https://github.com/hstenzel Build logs are purely cosmetic, the actual information regarding which versions/etc. are used is kept in the database and so re-triggering will still work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/concourse/concourse/issues/413?email_source=notifications&email_token=ABOOX5Z275ZG2V2MKEHPXW3PVG23TA5CNFSM4CCWGJC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVJHUGA#issuecomment-491944472 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ABOOX52QDUJWEQEVIGCHT3DPVG23TANCNFSM4CCWGJCQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/concourse/concourse/issues/413?email_source=notifications&email_token=AAAAOWBU5MKVXKG2O76VNB3PVHPPRA5CNFSM4CCWGJC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVJVPLA#issuecomment-492001196, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAAOWCKKMZ2KKD2YDHMA7TPVHPPRANCNFSM4CCWGJCQ .

StevenArmstrong commented 5 years ago

@vito it wasn't really just for retrigger. It was more if you are doing a redesign on what the + buttons or other buttons do it would be really good if they weren't nested under the job as users have said they find it confusing having to click through to manually initiate a new build when deploying to production. Instead I was suggesting having them a layer up as an icon on the pipeline page beside each job. You could then have a confirmation pop up if someone clicks it by mistake to confirm to build with latest or retrigger or cancel. This way you wouldn't need multiple buttons simply 1 button to build with multiple sub options without having to click through to the log to trigger anything. It's something that is frequently fed back about the UI from our users.

Lindsayauchin commented 5 years ago

@StevenArmstrong interesting idea. I think that from pipeline we have observed, like the (rabbit MQ team at Pivotal below) an action button to trigger a job on the pipeline page is just not scalable.

Screen Shot 2019-05-13 at 5 34 04 PM

We are thinking about the user pains around triggering a build with the work being done on the resource version. You can follow a related issue https://github.com/concourse/concourse/issues/3403 to see our progress on the UX changes.

hstenzel commented 5 years ago

If I understand correctly, to retrigger a specific job I'd scroll left/right on the build page to find the correct job?

Also, I'd potentially want to retrigger a successful job too, thinking about the case of rebuilding an artifact that was accidentally lost.

Perhaps these questions are really more about the UI related to the feature.

concourse / concourse

Ability to re-trigger failed build with the same input versions #413