buildkite-plugins / docker-compose-buildkite-plugin

🐳⚡️ Run build scripts, and build + push images, w/ Docker Compose
MIT License
171 stars 138 forks source link

Failures stopping containers are too quiet #445

Open ran-arigur opened 2 months ago

ran-arigur commented 2 months ago

Hi,

My company recently had an internal incident where our builds started failing in mysterious ways. The problem turned out to be that the Docker Compose Buildkite Plugin's pre-exit hook script was failing to stop containers (for reasons outside the scope of this issue), so they were sticking around, causing a buildup of running Docker containers that ultimately consumed too many resources.

The issue is, the script in question swallows these failures with || true [code link], which meant that we didn't discover the issue when we were testing the change that ended up triggering the incident, and that even once the incident was underway, it took a long time to locate the issue.

(The command output does appear in the logs, but it's doubly collapsed: firstly because the build step succeeds, and secondly because it's inside a collapsed section. So the errors are completely invisible until you've specifically discovered that containers aren't being stopped and specifically go looking for why.)

I do think it makes sense to avoid failing the build step at this point, but I think the script should emit a warning annotation, and should de-collapse the relevant section of the build log.

To do that, I think the relevant function (compose_cleanup) should have a local had_failure variable; the || true should be replaced with || had_failure=yes; and the function should end with something like this:

if [[ "$had_failure" ]] ; then
  local job_desc="$BUILDKITE_LABEL"
  if (( ${BUILDKITE_PARALLEL_JOB_COUNT-1} > 1 )) ; then
    job_desc+=" $((BUILDKITE_PARALLEL_JOB+1))/$BUILDKITE_PARALLEL_JOB_COUNT"
  fi

  local job_link="[$job_desc](#$BUILDKITE_JOB_ID)"

  buildkite-agent annotate \
    "Error in Docker Compose Buildkite Plugin pre-exit hook in $job_link." \
    --style warning \
    --context docker-compose-buildkite-plugin-pre-exit-failure

  echo '^^^ +++'
fi

I think the same should also be done in ensure_stopped [code link].

Disclaimers:

  1. Not tested.
  2. I'm not sure if this should be done for all occurrences of || true, or if some errors should really remain silent. (docker logs, for example, doesn't seem worth warning about.) I'll defer to your judgment. 🙂

Thanks in advance!

tomowatt commented 2 months ago

Hey @ran-arigur

Thank you for raising this and for the example code, we will raise this internally to improve the usage of the Plugin itself as definitely debugging of errors is harder when messages useful output is hidden and I think annotations and re-openening build log group are a great way to highlight for those types of scenarios