Can not retry job in non interactive mode #1335

Closed BuildStream-Migration-Bot closed 1 year ago

See original issue on GitLab In GitLab by [Gitlab user @willsalmon] on Jun 2, 2020, 13:05

Summary

CI jobs do not attempt to rebuild after failure, ether from a local artifact or after pulling a failed artifact.

if this was run locally you could ask for a rebuild interactively but if bst is running in non interactive mode this is not possible.

Suggestion

Add

bst --no-interactive build --retry-cached-failures ELEMENT.bst

also/or as a userconfig option

From IRC

<WSalmon> bst will cache failure, this can be good as if this happens in CI i can then pull down the artifact and look at the log. but some times things fail for intermittent reasons, if i have a runner with a "semi" persistent cache and i try to rerun the job bst will not try to rebuild, i have to go and manually clear the runner's cache. is there a way to keep the creation and pushing of artifacts but have bst rerun when it finds cached failure? and if it 
<WSalmon> dose rerun push the successful artifact over the failed on?
<WSalmon> i basiclly want to do `bst --no-interactive build --always-retry ELEMENT.bst`
<WSalmon> juergbi, benschubert ^
<juergbi> WSalmon: yes, we should add such an option. always-retry might not be the best name as it might also mean retry failures in the new session
<juergbi> and we should make sure push replaces the artifact proto on the server
<juergbi> I'm wondering whether it would make sense to implicitly expire failed artifacts
<juergbi> at least in terms of considering an element cached
<juergbi> i.e., always retry builds if the only cached artifact is an old failed build. the question would be how do we define 'old'
<juergbi> maybe even have a userconfig for this. and if you specify 0s as expiry, it will never use cached failures
<WSalmon> sounds good to me, just wanted to check we didnt have this already, ill make a issue
<WSalmon> i presume this is addional cli so dosent need to block bst2?
<juergbi> yes, I don't think this needs to block bst2
<juergbi> (but would still make sense to get done before)
<WSalmon> +1

Extension

As noted in the IRC logs dealing with the cache failure after a success full build has commenced needs to be handled gracefully.

In GitLab by [Gitlab user @willsalmon] on Jun 2, 2020, 13:05

changed the description

In GitLab by [Gitlab user @willsalmon] on Jun 2, 2020, 13:07

changed the description

In GitLab by [Gitlab user @willsalmon] on Jun 2, 2020, 13:07

changed the description

In GitLab by [Gitlab user @tpollard] on Jun 2, 2020, 13:27

changed the description

In GitLab by [Gitlab user @tpollard] on Jun 2, 2020, 13:27

This would be nice, I'd also agree with it just being user-config data

In GitLab by [Gitlab user @willsalmon] on Jun 2, 2020, 13:31

changed the description

In GitLab by [Gitlab user @cs-shadow] on Jun 2, 2020, 15:35

if this was run locally you could ask for a rebuild interactively but if bst is running in non interactive mode this is not possible.

I don't think this is true. If you have a cached failed build today, BuildStream will NOT give you a prompt. This is definitely a bug, but it's not tied to interactive mode vs non-interactive mode.

That aside, I'm not sure I'd be in favor of a specific flag for bst build when we already have a top-level bst --on-error [continue|quit|terminate] option. It might sense to support a new value for that, rather than inventing a new flag.

In GitLab by [Gitlab user @willsalmon] on Jun 2, 2020, 15:46

So can --on-error be used in this use case? would we be better with adding --on-error [continue|quit|terminate|retry]

In GitLab by [Gitlab user @juergbi] on Jun 2, 2020, 15:46

That aside, I'm not sure I'd be in favor of a specific flag for bst build when we already have a top-level bst --on-error [continue|quit|terminate] option. It might sense to support a new value for that, rather than inventing a new flag.

We may want to distinguish between retrying builds that fail in the current bst build session and retrying builds that failed in earlier sessions (cached failure). And especially for user config supporting on-error: retry is not enough and rather odd, in my opinion (e.g., how many times shall BuildStream retry).

In GitLab by [Gitlab user @cs-shadow] on Jun 22, 2020, 22:11

Sorry, this slipped through my inbox.

[Gitlab user @juergbi] I think you are right. Today's choices for --on-error generally refer to issues in the current session, so it's probably best to not mix them with the choices for cached failures. And retrying builds that failed in the current session as a rule would indeed be odd.

Initially I was suggesting --on-error retry to refer to retrying cached failures. In hindsight, this was confusing for the same reason you mentioned in the irc snippet above :)

However, --retry-cached-failures is still tricky since we might not have the build dependencies and sources for the element if we just pulled a failed artifact. This is also why we don't have a nice story for retrying cached failures even in the interactive mode. Once we solve it though, I don't think interactivs vs. non-interactive mode would be that different.

In GitLab by [Gitlab user @juergbi] on Aug 3, 2020, 07:16

mentioned in merge request !1978

However, --retry-cached-failures is still tricky since we might not have the build dependencies and sources for the element if we just pulled a failed artifact. This is also why we don't have a nice story for retrying cached failures even in the interactive mode. Once we solve it though, I don't think interactivs vs. non-interactive mode would be that different.

Ideally, we wouldn't pull a cached failure in this mode. But I guess we need to pull at least the proto to know whether it is a failure. Either way, I just tried retrying a failed build (interactively) and it just works fine, dependencies and sources were already there event hough I started from scratch, so no worry about this.

For the option name, I think it's better to call it --ignore-cached-failures rather than retry. It would also ignore them when pulling.

Perhaps it might make sense as part of generic cache pull policy which has options all/successes/none. Default could be all.

This is critical bug. We have been hitting this a lot. When runner runs out of disk space, BuildStream caches the error permanently and propagates it to developer machines and child projects.

I'm strongly in favor of the original suggestion of cached build expiry as suggested by @juergbi . We need this fixed, we're hitting this very frequently. The user config option would be simple to use and integrate into complicated CI flows. Default when value is missing should be no expiration for backwards-compat.

The utility of cached failed builds is for the purpose of debugging them as far as I can see, so bst shell and bst artifact checkout are potentially interesting for failed builds.

I think that when someone types bst build, they are expressing the intention to attempt to build, and we can dispense with the idea of adding an additional switch/knob/gwizzle/option for this purpose entirely, until such a time that there is an explicit utility to having the failed build not be retried at bst build time, if that ever occurs.

Does this sound sensible ?

That sounds fine to me. It would definitely be better than current behaviour where a single build in parent project can contaminate artifact caches with failed builds including child projects forever.

You gues are hitting CI problems were the build failed due to a "in consistent runner" or some other "in consistent behavour"

But as a use of bst that does not rely on CI as much as you gues for maintenance i have a different perspective.

When i am debugging something i will try out several different options and experiment, to try to get things working.

Often the cycle time is in 10s of minutes, if i try something that i tried before and failed then bst tells me about this and saves me 10s of minutes. I think this should be the default behaviour.

I can see that when using a bot to update things then the cycle time is less important.

I can see that we could always rebuild if we are in non interactive mode maybe but i think that would be fairly un intuitive.

I am still of the opinion i was when i first made this issue 3yrs ago that a new cli option would be the best option, then you can add this in when you want. But for the majority of cases were your works work you get the worth while feature it sounds like we are in danger of turning off.

You gues are hitting CI problems were the build failed due to a "in consistent runner" or some other "in consistent behavour"

But as a use of bst that does not rely on CI as much as you gues for maintenance i have a different perspective.

My take home from this is that when working interactively, it is more helpful to encounter a failed build early and have that feedback from the CLI, and having the option to retry the build, rather than unconditionally retrying.

I agree with this.

@wsalmonct it is very improbable to have 100% guarantee of consistent running in CI even with fully reproducible build. You can hit various random errors eg running out of disk space or any host issues. Those will be permanently cached in artifact cache and block project from working until cache is manually purged.

IMHO a significant problem here is cached failure is pulled automatically. When you have failure in local cache, it's fine that the failure persists.

In CI as well, you can typically wipe caches in runner. But if central cache is contaminated and failures are automatically pulled, it breaks every runner.

@nanonyme i agree that this is a issue. I created this issue!

If we add a flag to try to rebuild if we meet a failed build (which you may default to, maybe set in .bst2/, and i may use some times) then we fix the issue without effecting other good features.

If we don't pull failed builds by default then we lose easy local debugging when the failure was in CI.

If we default to building when there is a cached build then we lose the fail early for local or CI when we normally want it (projects with occasionally issues can add a manual job with the option and projects with unreliable CI can just add it to there scripts/config were there is no cost to having it).

We could also disable pulling failures by default and have user config option to turn it on

This "pull failures by default" is an unexpected behavioural change from BuildStream 1 which has a lot of downsides.

I would argue that its best to have the good for "local dev"/"new user" flow be default and the CI be the optional behaviour.

But I would happily have all these as optional flags with user configurable defaults

I disagree, again. It's very unexpected for a new user that when you do a build, it will fail because of reasons completely unrelated to what you are trying to build. Local development is continuation of CI, it's not a separate thing.

When you think you are doing local development, you are, in most non-trivial BuildStream usage, using CI-built artifacts and CI-created source caches.

The work flow that I follow is:

Try something new locally.
Once it works then push and have CI run.

If it turns out that I'm trying todo something some one else tried in CI then its nice if bst by default told me and went and pulled the logs so i can see the failure. (if you have buildtrees turned on you can even drop straight in to a shell to help debug)

Or if I used a bot like FD and the bot did something that really failed i just check out that branch and run bst build and i can have the logs or shell from the UI. (and i only wait for the pull not a hole build)

Then the ci just always has the --retry-cached-failures in the CI scripts so if it ever does fail due to flaky CI you just re run the job.

Yes, that is not how gnome and freedesktop-sdk and derivatives use BuildStream except very special people with beefy build machines.

I again disagree with build switch. It needs to be user config. Otherwise we can't propagate it properly.

It sounds like your saying you don't want a flag and user config . and you just want user config ? because I'm saying that I want both.

Well as some one who does a lot of building on top of FD_SDK at work i have reliable CI that i want to be able to pull the CI failures down as they are 99.9% real issues that i want the info for and almost never the CI being flaky.

In this case i want to pull the failed build and i want bst to give me options to look at the logs and drop into a shell if i have buildtrees enabled.

If there is a CI flaky issue then I am going to just restart the CI job and its wasy for the fag or user setting to be set in the ci config as i already have a load of things in the CI set from default.

And at home i usually only push to CI at the last minute as my home laptop or desktop are free once i buy them and the CI is pay per minute.

Even when it's 0.1%, evicting the bad artifact in high-security environment can mean days worth of work.

I am not against a flag. I am strongly for user config, otherwise integrating this is basically impossible.

Sorry for multiple posts, I had extremely bad connection overseas.

And yes, pulling failure artifact absolutely must be possible through bst artifact and possibly also best shell command (though I would personally prefer error that element is not built to cached failures with bst shell as well). The not nice behaviour is bst build pulling the failure. As said, this is actively dangerous default behaviour since it can result in automatic propagation of random failures to child project caches. (since bst build does cache push) This is very unpleasant to fix.

Also iirc artifact checkout for failed builds currently is not supported making cached failures mostly useless if your source root has no functional shell.

Lots of talk on this issue, any thoughts on the approach in #1849 ?

@gtristan can it be set in user config?

Also we don't want to retry failed local builds. We want to build ignoring cached failures.

If CI hits failure, it must follow normal failure behaviour

@gtristan can it be set in user config?

Yes.

Also we don't want to retry failed local builds. We want to build ignoring cached failures.

The default hasn't changed. You can explicitly retry a failed build with bst build --retry-failed or set that in your user configuration if you want to retry, this applies regardless of the machine on which bst is running (and bst shouldn't really know what machine it's running on).

If CI hits failure, it must follow normal failure behaviour

With this patch, your project defines normal failure behavior, and can decide whether to retry or not (I think it's a good idea to retry unconditionally in CI).

Okay. Then yes, it fixes the symptom. I still don't like the idea of downloading gigabytes of unnecessary failure buildtree into CI from CAS but seem I am having hard time convicing everyone the problem is not just the lack of retry but the downloading of failure artifacts.

Okay. Then yes, it fixes the symptom. I still don't like the idea of downloading gigabytes of unnecessary failure buildtree into CI from CAS but seem I am having hard time convicing everyone the problem is not just the lack of retry but the downloading of failure artifacts.

This seems like a nice optimization to do but I would consider that as orthogonal to this issue, currently artifacts are downloaded as a unit, before buildstream knows that it is a failure or not.

That said, it may be possible to have buildstream start by downloading the artifact metadata and conditionally download the artifact content in the case that the client doesn't want the data (which might be determined by whether we have --retry-failed specified).

It sounds like a good enhancement, it would be good to file/handle that in a separate issue.

Fair enough.

This is now completely blocking freedesktop-sdk and gnome development. Please handle asap and release fix.

This is a very serious bug in BuildStream. It's not just an enhancement.

To be fair, cached failures have very limited use for us without https://github.com/apache/buildstream/issues/1822.

Closing this after having merged #1849

apache / buildstream