brigadecore / brigade

Event-driven scripting for Kubernetes
https://brigade.sh/
Apache License 2.0
2.4k stars 247 forks source link

automation: nightly cleanup job can fail #1831

Closed krancour closed 2 years ago

krancour commented 2 years ago

The nightly cleanup jobs that cleans out our unstable image registry fails if any of the repositories it deletes do not currently exist. This could happen after a slow day.

This isn't a huge deal, but it can be improved.

mariuskimmina commented 2 years ago

I would like to give this a shot as my first issue here, I have found the clean-up in .brigade/brigade.ts and I guess there are 2 ways to go about it

  1. Request a list of all repos from azure first and use this list instead of the hard coded one (this requires an extra request to azure tho)
  2. Just wrapping the await job.run() in a try/catch statement (can there be other cases where it might important for this to fail?)
krancour commented 2 years ago

@mariuskimmina thanks for volunteering!

Just wrapping the await job.run() in a try/catch statement (can there be other cases where it might important for this to fail?)

There's actually a built-in way of specifying that a job is allowed to fail without making the whole workflow fail. You can mark the job as fallible, i.e. job.fallible = true. However, and this is my fault for wording the issue poorly, that's not what we're after here.

What I meant to say in this issue is that the job sometimes fails, but I'd like to address the issues that make it prone to failure in the first place. i.e. The hardcoded list of repositories to delete needs to go -- especially since it's already outdated.

So your first option is the better one:

Request a list of all repos from azure first and use this list instead of the hard coded one (this requires an extra request to azure tho)

I don't care about the extra call. The script will, however, end up being more complex than anything you or I would probably want to write in-line in the brigade.ts file, so I would suggest writing a new shell script for nightly cleanup, stashing it in the scripts/ directory and having the job just invoke that.

Sound ok?

Edit: Don't forget to chmod 755 on the new script.

mariuskimmina commented 2 years ago

Hey, so I created a bash script that gets the list of repos from azure and then deletes all of them, tested it on my own azure account. I'm not sure how to call the script from brigade.ts tho, or, more percisely, how I could test that

What I would do is:

let scriptname = "scripts/nightly-cleanup.sh"
job.primaryContainer.arguments = ["-c", scriptname]

would that work? Is there a way to test this locally?

krancour commented 2 years ago

Is there a way to test this locally?

Short of installing Brigade in a local k8s cluster (which is totally doable), no.

If you have Node installed, you can run a Brigade script, standalone, locally with a dummy event, but the jobs will all be no-ops. It will just tell you "this is where job x would have run." Really, it's only useful when you have conditionals in your workflow and want to test out what the whole workflow looks like end-to-end under different conditions. It's not really for executing individual jobs.

From time to time I've though about if it's valuable to create some kind of local, "mini-Brigade," or how we might approach it, but it's never been a priority.

To your immediate problem, the command and arguments attributes of the jobs work the exact same way they do in Docker, Docker Compose, and Kubernetes. So in this case, provided the script is executable, all you need is this:

job.primaryContainer.command = ["/path/to/the/script.sh"]