basecamp / kamal

Deploy web apps anywhere.
https://kamal-deploy.org
MIT License
9.41k stars 362 forks source link

Remove the healthcheck step #740

Closed djmb closed 1 month ago

djmb commented 3 months ago

To speed up deployments, we'll remove the healthcheck step.

This adds some risk to deployments for non-web roles - if they don't have a Docker healthcheck configured then the only check we do is if the container is running.

If there is a bad image we might see the container running before it exits and deploy it. Previously the healthcheck step would have avoided this by ensuring a web container could boot and serve traffic first.

To mitigate this, we'll add a web barrier. Non web containers will wait before shutting down the old containers until at least one web container has passed its healthcheck.

It the web container fails its healthcheck, we'll close the barrier and shut down the new containers on the non-web roles.

We also have a new integration test to check we correctly handle a a broken image. This highlighted that SSHKit's default runner will stop at the first error it encounters. We'll now have a custom runner that waits for all threads to finish allowing them to clean up.

That means that if we have a deployment that completes on some hosts but not others we can run kamal app version --quiet to see which version is running on each host.

djmb commented 3 months ago

I'm hoping to get https://github.com/capistrano/sshkit/pull/535 added to SSHKit, so we don't need SSHKit::Runner::ParallelCompleteAll. It works as the default runner, but if you are doing a rolling deploy, that uses the group runner which internally uses SSHKit::Runner::Parallel.

argia-andreas commented 3 months ago

@djmb Hi! If I understand correctly the first healthcheck step will be removed by this. (But the container/docker healthchecks will remain, /up endpoint or a cmd for non web-roles.)

At the moment, we are using the fact that the healthcheck step runs before everything to run our migrations, which works nicely as we assure they run successfully before we start serving traffic from new web traffic serving containers.

Any recommendations for achieving a similar outcome, that is to make sure commands are run on one container successfully before traffic is served to new containers or new containers are booted, when the healthcheck step is removed?

Thanks.

argia-andreas commented 3 months ago

Also, would this PR resolve the issue with not being able to deploy to a specific role on a host that doesn't have the primary role?

As mentioned in this issue: #709

For additional context: This seems to be due to initial container healthcheck step which is performed with the primary role.

kamal deploy -r secondary-role

Ensure app can pass healthcheck...
  INFO [703ee48c] Running docker run --detach --name healthcheck-primary-role --publish 3999:80 --label service=healthcheck-primary-role -e KAMAL_CONTAINER_NAME="healthcheck-primary-role" --env-file .kamal/env/roles/primary-role.env --health-cmd "curl -f http://localhost/up || exit 1"
  INFO [a8ccc7f3] Finished in 0.075 seconds with exit status 123 (failed).
Releasing the deploy lock...
  Finished all in 22.6 seconds
  ERROR (SSHKit::Command::Failed): Exception while executing on host 192.168.0.11: docker exit status: 125
docker stdout: Nothing written
docker stderr: docker: open .kamal/env/roles/primary-role.env: no such file or directory.
djmb commented 2 months ago

@argia-andreas - you should be able to run migrations before anything is booted with a pre-deploy hook.

You'd want to run something like:

kamal app exec -p -q -d $KAMAL_DESTINATION --version $KAMAL_VERSION "rails db:migrate"

Regarding #709 - this is intended to fix that though I'll need to double check that it does!