build-canaries / nevergreen

:baby_chick: A build monitor with attitude
https://nevergreen.io
Eclipse Public License 1.0
209 stars 38 forks source link

Cancelled Jenkins jobs appear as failed builds #145

Closed ajlanghorn closed 8 years ago

ajlanghorn commented 8 years ago

I've noticed that if I track a Jenkins job in Nevergreen, run the job in Jenkins and then cancel it before it could complete (and, therefore, so it never gets a success or failure result), then Nevergreen treats the cancelled build as a failed one and the job blob in Nevergreen appears red.

jimmythompson commented 8 years ago

What do you think it should do instead?

Personally, I'd argue that having cancelled builds show red is the ideal. Hiding it sounds awful. Changing the colour to, say, grey only seems to inspire inaction, which sounds weird for something like nevergreen:

"Oh, it's been cancelled, someone else is already looking at it then."

GentlemanHal commented 8 years ago

Hi @ajlanghorn, thanks for raising this issue.

Pretty sure this is something Nevergreen can't distinguish, as I think the project status and activity can only get set to failure and sleeping in the cctray xml for cancelled builds. Which is exactly the same values you'll get for legitimately broken builds.

Some CI servers may have unique values for such a situation but they'd be non-standard and we'd have to figure out how to get a comprehensive list.

Can you give us the cctray xml entry for the cancelled project so we can see what value it has (feel free to delete/obfuscate the name and url as they won't matter)?

I also agree with @jimmythompson I think builds being forcibly cancelled should be treated like broken builds, something has happened that requires action. The whole team should be aware of this.

As an aside; We also currently already have grey builds, we use that for builds in an unknown state (which rarely happens, if I remember correctly you'll get this in GoCD for the first run of a new pipeline)

I would be interested in hearing your thoughts though.

ajlanghorn commented 8 years ago

Typically speaking, @jimmythompson, I'd agree with you, but I also think it's reasonably important to note the distinction between a build failing because, say, the application couldn't be built before hitting the first environment in a deployment pipeline, and a job being cancelled because it was hogging executors, being too noisy, or otherwise run outside agreement.

The distinction here is between application build jobs and more administrative-focused jobs run in a CI server because it provides a convenient, centralised place for them (so, things like a pg_dump, for instance, or something to parse statistics from an Nginx access log, or to run ZAProxy regularly).

Unfortunately, having looked, it seems that Jenkins doesn't differentiate between a failed and cancelled job in cc.xml - that's possibly because CruiseControl never did, back when the spec. was originally written, but the URL on the Jenkins wiki redirects back to TW.com, so I can't see the spec (and haven't dug much further in to finding it just yet).

My current project doesn't make use of GoCD; we're a Jenkins shop, pretty much exclusively, so I've not seen the grey status -- thanks for the pointer, @GentlemanHal :)

joejag commented 8 years ago

Good chat, thanks for bringing this up @ajlanghorn

Let's get philosophical for a moment.

There's something about using a CI server as a cron replacement that makes me feel uneasy.

CI is about making sure changes are fully integrated and pass muster. With this lens a cancelled job is an important signal. It's letting you know that something is not integrated. It's something you immediately want to resolve before the problem gets worse.

Nevergreen has stayed true to this world view.

I've used rundeck in the past to separate "tasks" from "integration builds". It felt a more natural fit for that sort of work. But I know this isn't the industry norm.

ajlanghorn commented 8 years ago

CI is about making sure changes are fully integrated and pass muster. With this lens a cancelled job is an important signal. It's letting you know that something is not integrated. It's something you immediately want to resolve before the problem gets worse.

Granted, my initial idea here was a little niche, and in 99.9% of cases, I agree wholeheartedly with you on this point. For certain things, though, I find using a CI server to achieve my aims of running a job in a repeatable fashion at set intervals quite useful. Typically, there are three ways of running these jobs:

  1. Using event-driven infrastructure, such as Lambda or OpenWhisk.
  2. Spinning up new instances solely to perform these functions (say, EC2 spot instances).
  3. Using existing infrastructure which performs scheduled actions well

In a situation where a CI server is already available (see below comment, too), then its performing one of its core tasks well: running a menial task repeatedly, triggered in to life by some event (be that a mouse-click, a timer or a Git commit). The CI server also chucks in history for me, so I don't have to bother about doing anything extra to store console output, job status etc. external to where it ran (which I would have to if I used options 1 or 2, given the infrastructure running the job would not exist for much longer than the job itself).

Partially, granted, I use CI servers for things like this out of ease: the thing's there, people are using it regularly, and there's a central place for those jobs. The CI server manages secrets, times, events, history and a whole manner of other stuff for me, so rather than reinvent the wheel (when I usually have bigger fish to fry!), then I end up using the CI server.

I've used rundeck in the past to separate "tasks" from "integration builds". It felt a more natural fit for that sort of work. But I know this isn't the industry norm.

Personally, I shy away from using Rundeck (or equivalents) when a CI server is available. I wouldn't necessarily do so if a CD server was being used, because there's no logical home on a CD server if you stick to the idea that the CD server is tied inextricably from pipelines.

GentlemanHal commented 8 years ago

Unfortunately, having looked, it seems that Jenkins doesn't differentiate between a failed and cancelled job in cc.xml - that's possibly because CruiseControl never did, back when the spec. was originally written, but the URL on the Jenkins wiki redirects back to TW.com, so I can't see the spec (and haven't dug much further in to finding it just yet).

Thanks for checking!

I don't think it's possible to find more details about the spec anymore, most of the links I found just dump you on a generic TW.com page as well. The link in my first post is one of the only documents I've managed to find.

Given its looking unlikely we'd be able to distinguish cancelled builds, I'm thinking of closing this issue and marking as won't fix.

If in the future we get the time to better analyse the cctray xml output by different CI servers and find additional status codes or fields that we might be able to use, we can always reopen.

Thoughts?

ajlanghorn commented 8 years ago

@GentlemanHal Totally makes sense; happy to close this off. Thanks for the chat, all! :)