Multiple containers on a single machine.

kczpl commented 1 month ago

Hi! I switched a couple of my projects to Kamal. In some of them, especially in staging environments, I use multiple containers on one machine. Usually, we work with separate repositories for the frontend (FE) and backend (BE). On both, I use Kamal to deploy containers. Most of these applications are behind a load balancer, which also handles SSL termination.

               -> App server 1 (BE+FE containers)
Load Balancer  -> App server 2 (BE+FE containers)
               -> App server 3 (BE+FE containers)

The issue is that I experienced weird behavior during deployments. I postponed creating this issue because I couldn't find a common reason for it. Generally, sometimes deployments stop working, showing that the health check doesn't pass. When I stop the container and rerun the deployment in my CI/CD pipeline, it works again.

The error looks like this:

 ERROR (SSHKit::Command::Failed): Exception while executing on host <domain of random App Server>: docker exit status: 1
docker stdout: Nothing written
docker stderr: Error: target failed to become healthy

Additionally, I ran the deployment in verbose mode, and none of the containers returned a status of "unhealthy."

In my humble opinion, deploying multiple containers on one machine is a common use case. As I've investigated this issue for a while, I can say that the load balancer layer works just fine, and the containers are healthy. I assume the issue lies somewhere in Kamal's proxy and the way Kamal handles health checks.

I would love some hints or advice, or maybe there's something I’m doing wrong when defining health checks. Perhaps someone has successfully run such an architecture and can share the solution.

Those are my configs:

Backend app:

service: app_backend_app
image: app_backend

builder:
  arch: amd64
  dockerfile: Dockerfile

servers:
  web:
    hosts:
      - api1.internal.domain.com
      - api2.internal.domain.com
  sidekiq:
    cmd: bundle exec sidekiq
    hosts:
      - api1.internal.domain.com
      - api2.internal.domain.com

registry:
  server: my-selfhosted.registry.com
  username:
    - CI_REGISTRY_USER
  password:
    - CI_REGISTRY_PASSWORD

env:
  secret:
    - RAILS_MASTER_KEY
    - RAILS_ENV
    - SOME_ENVS

ssh:
  user: myuser

proxy:
  healthcheck:
    path: /health # this endpoints checks Redis and Postgres connections and returns 200 ("OK" message) if everything is ok
    interval: 2 
    timeout: 30
  host: api.domain.com,api1.internal.domain.com,api2.internal.domain.com
  app_port: 3000
  ssl: false
  forward_headers: true
  response_timeout: 30

Frontend app:

service: ap_frontend_app
image: app_frontend_backend

builder:
  arch: amd64
  dockerfile: Dockerfile

servers:
  web:
    hosts:
      - api1.internal.domain.com
      - api2.internal.domain.com

registry:
  server: my-selfhosted.registry.com
  username:
    - CI_REGISTRY_USER
  password:
    - CI_REGISTRY_PASSWORD

ssh:
  user: myuser

proxy:
  healthcheck:
    path: /health # this endpoints returns 200 OK from NGINX container
    interval: 2 
    timeout: 30
  host: app.domain.com,app1.internal.domain.com,app2.internal.domain.com
  app_port: 3000
  ssl: false
  forward_headers: true
  response_timeout: 30

I deploy containers from CI/CD pipeline using:

    - kamal deploy --skip-push --version=$CI_COMMIT_REF_SLUG -d staging -v

Thats how I build containers in my Gitlab CI, maybe it is important in that case:

build:
  stage: build
  image: docker:25-dind
  script:
    - |
      if [ "$CI_COMMIT_BRANCH" == "master" ]; then
          DOCKERFILE=Dockerfile.production
          LABEL=$PRODUCTION_LABEL # for kamal
      elif [ "$CI_COMMIT_BRANCH" == "staging" ]; then
          DOCKERFILE=Dockerfile.staging
          LABEL=$STAGING_LABEL # for kamal 
      else
          echo "Unknown branch"
          exit 1
      fi
    - echo "$CI_JOB_TOKEN" | docker login -u $CI_REGISTRY_USER --password-stdin $CI_REGISTRY
    - docker pull "$CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG" || true
    - |
      docker build --cache-from "$CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG" \
        -t "$CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG" \
        --label "service=$LABEL" \ # Kamal label
        -f $DOCKERFILE .
    - docker push "$CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG"

tuladhar commented 1 month ago

Do you experience "target unhealthy" issue for both frontend and backend deploys? How frequently does it occur; on every deployment or occasionally.

kczpl commented 1 month ago

Do you experience "target unhealthy" issue for both frontend and backend deploys? How frequently does it occur; on every deployment or occasionally.

It occurs for both frontend and backend deployments, but not necessarily at the same time.

What is common is that it always happens like this:

The CI/CD deployment stage fails due to "target unhealthy" – sometimes on two servers, sometimes on one server. The pre-deploy containers are still up.
I restart the deployment pipeline, but the same error occurs.
I stop the respective containers (for instance, the backend app container when the backend app fails).
The deployment then works just fine

I have checked the pipeline history, and it happens every workday, sometimes twice a day, for the backend app. A similar situation has occurred for the frontend app, but they haven't merged anything in a week. :))

djmb commented 1 month ago

Could you provide more context from the logs for your error? What command is it running that produces it?

kczpl commented 1 month ago

Could you provide more context from the logs for your error? What command is it running that produces it?

Sure. I’ve attached some logs below.

For more context, in this example, the BE and FE containers are deployed on two machines with internal domains that point to each machine: staging-app-1.internal.domain.com and staging-app-2.internal.domain.com. There is also a Sidekiq instance (not included in the proxy) and a docs static web app as an accessory (also not used in the proxy).

In this job we see:

  Finished all in 84.4 seconds
  ERROR (SSHKit::Command::Failed): Exception while executing on host staging-app-2.internal.domain.com: docker exit status: 1
docker stdout: Nothing written
docker stderr: Error: target failed to become healthy

So I took a look at the logs on staging-app-2.internal.domain.com.:

There is also:

  INFO [fc6e8fdc] Finished in 0.331 seconds with exit status 0 (successful).
 DEBUG [cc575326]   Error: target failed to become healthy
 ERROR Failed to boot web on staging-app-2.internal.domain.com

A kamal-proxy command:

  INFO [cc575326] Running docker exec kamal-proxy kamal-proxy deploy my_backend_app-web-staging --target="ad9e1df96129:3000" --host="api.domain.com" --host="staging-app-1.internal.domain.com" --host="staging-app-2.internal.domain.com" --deploy-timeout="30s" --drain-timeout="30s" --health-check-interval="2s" --health-check-timeout="30s" --health-check-path="/health" --target-timeout="30s" --buffer-requests --buffer-responses --forward-headers --log-request-header="Cache-Control" --log-request-header="Last-Modified" --log-request-header="User-Agent" on staging-app-2.internal.domain.com

  INFO [50a9baf6] Running docker exec kamal-proxy kamal-proxy deploy my_backend_app-web-staging --target="10bc75d5c4c6:3000" --host="api.domain.com" --host="staging-app-1.internal.domain.com" --host="staging-app-2.internal.domain.com" --deploy-timeout="30s" --drain-timeout="30s" --health-check-interval="2s" --health-check-timeout="30s" --health-check-path="/health" --target-timeout="30s" --buffer-requests --buffer-responses --forward-headers --log-request-header="Cache-Control" --log-request-header="Last-Modified" --log-request-header="User-Agent" on staging-app-1.internal.domain.com

  INFO [50a9baf6] Finished in 26.606 seconds with exit status 0 (successful).
...

  INFO First web container is healthy on staging-app-1.internal.domain.com, booting any other roles

This is a docker run command:

INFO [3ebe7468] Running docker run --detach --restart unless-stopped --name my_backend_app-web-staging-staging --network kamal --hostname staging-app-2.internal.domain.com-07d2606e416b -e KAMAL_CONTAINER_NAME="app_backend_app-web-staging-staging" -e KAMAL_VERSION="staging" --env ENABLE_SIDEKIQ="false" --env RAILS_LOG_TO_STDOUT="true" --env PIDFILE="/tmp/server.pid" --env-file .kamal/apps/app_backend_app-staging/env/roles/web.env --log-opt max-size="10m" --label service="app_backend_app" --label role="web" --label destination="staging" myregistry.com/app_backend:staging on staging-app-2.internal.domain.com

After that, we stopped the container on staging-app-2.internal.domain.com, reran the job, and it worked.

Referring to this cc575326 hash, I assume it is the docker exec kamal-proxy kamal-proxy deploy command that is failing, and more specifically, it is assuming that my container is not healthy.

Do you want me to provide more logs from the failed job?

djmb commented 1 month ago

Thanks @kczpl!

If the deployments sometimes pass succeed then your configuration sounds like it should be ok. Maybe the containers are taking too long to boot and hitting the deployment timeout (defaults to 30s)?

You can increase it by setting:

deploy_timeout: 60

You could also check the logs of the kamal-proxy and app containers after a failed deployment to see if there are any hints there.

kczpl commented 1 month ago

Thank you very much @djmb

It seems like a good hint. I had considered timeouts and focused on tuning the health check timeouts, but apparently, I forgot about the global defaults. I have no idea how I could have overlooked this! 😄

(For anyone who reads this issue in the future, it's here).

I’ve made that change in one project that experienced this issue, and I’ll need a few development days to figure out whether it helps or not. I will give you a heads-up for sure :))

Rohland commented 1 month ago

I'm running into a similar issue, but only for a server where I'm not using the proxy.

Here is the kamal config (I just set the health-cmd to exit with 0 for testing):

proxy: false
deploy_timeout: 60

servers:
    workers:
        hosts: [ .. ]
        options:
            health-cmd: exit 0

If I run a watch 'docker ps' on the host while deploying, I see the container becomes healthy around 30s. However, deploy still fails with:

ERROR {"Status":"healthy","FailingStreak":0,"Log":[{"Start":"2024-10-25T17:11:43.666173955Z","End":"2024-10-25T17:11:43.705803818Z","ExitCode":0,"Output":""},{"Start":"2024-10-25T17:12:13.706469904Z","End":"2024-10-25T17:12:13.741261852Z","ExitCode":0,"Output":""}]}
  INFO [c9188e98] Running docker container ls --all --filter name=^agent-workers-main$ --quiet | xargs docker stop on agent-241025-1e3a9
  INFO [c9188e98] Finished in 10.697 seconds with exit status 0 (successful).
Releasing the deploy lock...
  Finished all in 89.3 seconds
  ERROR (SSHKit::Command::Failed): Exception while executing on host agent-241025-1e3a9: docker exit status: 1
docker stdout: Nothing written
docker stderr: Error: target failed to become healthy

Any ideas?

Rohland commented 1 month ago

OK, just leaving this comment here for anyone else searching for this problem. Took a fresh look this morning, and realised proxy: false is meant to be configured in the relevant server - role block. So my config should have been:

servers:
    workers:
        hosts: [ .. ]
        proxy: false
        options:
            health-cmd: exit 0

Seems so obvious now 🫣

kczpl commented 4 weeks ago

If the deployments sometimes pass succeed then your configuration sounds like it should be ok. Maybe the containers are taking too long to boot and hitting the deployment timeout (defaults to 30s)?

You can increase it by setting:

deploy_timeout: 60

Hello, everyone! I would like to thank everyone involved in this discussion for helping me solve this issue. As I've observed, this problem no longer occurs, and the deployment works just fine!

What I did was simply set:

readiness_delay: 60
deploy_timeout: 60

One thing that still bothers me is a log message. Since the timeout was likely the reason, maybe we can adjust the error message to indicate that the container became unhealthy due to the timeout being exceeded?

thedumbtechguy commented 3 days ago

I had the same issue! Single container. I was using kamal remove and then deploying again.

The fix above worked for me.

basecamp / kamal

Multiple containers on a single machine. #1139