Open kczpl opened 1 month ago
Do you experience "target unhealthy" issue for both frontend and backend deploys? How frequently does it occur; on every deployment or occasionally.
Do you experience "target unhealthy" issue for both frontend and backend deploys? How frequently does it occur; on every deployment or occasionally.
It occurs for both frontend and backend deployments, but not necessarily at the same time.
What is common is that it always happens like this:
I have checked the pipeline history, and it happens every workday, sometimes twice a day, for the backend app. A similar situation has occurred for the frontend app, but they haven't merged anything in a week. :))
Could you provide more context from the logs for your error? What command is it running that produces it?
Could you provide more context from the logs for your error? What command is it running that produces it?
Sure. I’ve attached some logs below.
For more context, in this example, the BE and FE containers are deployed on two machines with internal domains that point to each machine: staging-app-1.internal.domain.com and staging-app-2.internal.domain.com. There is also a Sidekiq instance (not included in the proxy) and a docs static web app as an accessory (also not used in the proxy).
In this job we see:
Finished all in 84.4 seconds
ERROR (SSHKit::Command::Failed): Exception while executing on host staging-app-2.internal.domain.com: docker exit status: 1
docker stdout: Nothing written
docker stderr: Error: target failed to become healthy
So I took a look at the logs on staging-app-2.internal.domain.com.
:
There is also:
INFO [fc6e8fdc] Finished in 0.331 seconds with exit status 0 (successful).
DEBUG [cc575326] Error: target failed to become healthy
ERROR Failed to boot web on staging-app-2.internal.domain.com
A kamal-proxy
command:
INFO [cc575326] Running docker exec kamal-proxy kamal-proxy deploy my_backend_app-web-staging --target="ad9e1df96129:3000" --host="api.domain.com" --host="staging-app-1.internal.domain.com" --host="staging-app-2.internal.domain.com" --deploy-timeout="30s" --drain-timeout="30s" --health-check-interval="2s" --health-check-timeout="30s" --health-check-path="/health" --target-timeout="30s" --buffer-requests --buffer-responses --forward-headers --log-request-header="Cache-Control" --log-request-header="Last-Modified" --log-request-header="User-Agent" on staging-app-2.internal.domain.com
INFO [50a9baf6] Running docker exec kamal-proxy kamal-proxy deploy my_backend_app-web-staging --target="10bc75d5c4c6:3000" --host="api.domain.com" --host="staging-app-1.internal.domain.com" --host="staging-app-2.internal.domain.com" --deploy-timeout="30s" --drain-timeout="30s" --health-check-interval="2s" --health-check-timeout="30s" --health-check-path="/health" --target-timeout="30s" --buffer-requests --buffer-responses --forward-headers --log-request-header="Cache-Control" --log-request-header="Last-Modified" --log-request-header="User-Agent" on staging-app-1.internal.domain.com
INFO [50a9baf6] Finished in 26.606 seconds with exit status 0 (successful).
...
INFO First web container is healthy on staging-app-1.internal.domain.com, booting any other roles
This is a docker run
command:
INFO [3ebe7468] Running docker run --detach --restart unless-stopped --name my_backend_app-web-staging-staging --network kamal --hostname staging-app-2.internal.domain.com-07d2606e416b -e KAMAL_CONTAINER_NAME="app_backend_app-web-staging-staging" -e KAMAL_VERSION="staging" --env ENABLE_SIDEKIQ="false" --env RAILS_LOG_TO_STDOUT="true" --env PIDFILE="/tmp/server.pid" --env-file .kamal/apps/app_backend_app-staging/env/roles/web.env --log-opt max-size="10m" --label service="app_backend_app" --label role="web" --label destination="staging" myregistry.com/app_backend:staging on staging-app-2.internal.domain.com
After that, we stopped the container on staging-app-2.internal.domain.com, reran the job, and it worked.
Referring to this cc575326 hash, I assume it is the docker exec kamal-proxy kamal-proxy deploy command that is failing, and more specifically, it is assuming that my container is not healthy.
Do you want me to provide more logs from the failed job?
Thanks @kczpl!
If the deployments sometimes pass succeed then your configuration sounds like it should be ok. Maybe the containers are taking too long to boot and hitting the deployment timeout (defaults to 30s)?
You can increase it by setting:
deploy_timeout: 60
You could also check the logs of the kamal-proxy and app containers after a failed deployment to see if there are any hints there.
Thank you very much @djmb
It seems like a good hint. I had considered timeouts and focused on tuning the health check timeouts, but apparently, I forgot about the global defaults. I have no idea how I could have overlooked this! 😄
(For anyone who reads this issue in the future, it's here).
I’ve made that change in one project that experienced this issue, and I’ll need a few development days to figure out whether it helps or not. I will give you a heads-up for sure :))
I'm running into a similar issue, but only for a server where I'm not using the proxy.
Here is the kamal config (I just set the health-cmd to exit with 0 for testing):
proxy: false
deploy_timeout: 60
servers:
workers:
hosts: [ .. ]
options:
health-cmd: exit 0
If I run a watch 'docker ps'
on the host while deploying, I see the container becomes healthy around 30s. However, deploy still fails with:
ERROR {"Status":"healthy","FailingStreak":0,"Log":[{"Start":"2024-10-25T17:11:43.666173955Z","End":"2024-10-25T17:11:43.705803818Z","ExitCode":0,"Output":""},{"Start":"2024-10-25T17:12:13.706469904Z","End":"2024-10-25T17:12:13.741261852Z","ExitCode":0,"Output":""}]}
INFO [c9188e98] Running docker container ls --all --filter name=^agent-workers-main$ --quiet | xargs docker stop on agent-241025-1e3a9
INFO [c9188e98] Finished in 10.697 seconds with exit status 0 (successful).
Releasing the deploy lock...
Finished all in 89.3 seconds
ERROR (SSHKit::Command::Failed): Exception while executing on host agent-241025-1e3a9: docker exit status: 1
docker stdout: Nothing written
docker stderr: Error: target failed to become healthy
Any ideas?
OK, just leaving this comment here for anyone else searching for this problem. Took a fresh look this morning, and realised proxy: false
is meant to be configured in the relevant server - role block. So my config should have been:
servers:
workers:
hosts: [ .. ]
proxy: false
options:
health-cmd: exit 0
Seems so obvious now 🫣
If the deployments sometimes pass succeed then your configuration sounds like it should be ok. Maybe the containers are taking too long to boot and hitting the deployment timeout (defaults to 30s)?
You can increase it by setting:
deploy_timeout: 60
Hello, everyone! I would like to thank everyone involved in this discussion for helping me solve this issue. As I've observed, this problem no longer occurs, and the deployment works just fine!
What I did was simply set:
readiness_delay: 60
deploy_timeout: 60
One thing that still bothers me is a log message. Since the timeout was likely the reason, maybe we can adjust the error message to indicate that the container became unhealthy due to the timeout being exceeded?
I had the same issue! Single container. I was using kamal remove and then deploying again.
The fix above worked for me.
Hi! I switched a couple of my projects to Kamal. In some of them, especially in staging environments, I use multiple containers on one machine. Usually, we work with separate repositories for the frontend (FE) and backend (BE). On both, I use Kamal to deploy containers. Most of these applications are behind a load balancer, which also handles SSL termination.
The issue is that I experienced weird behavior during deployments. I postponed creating this issue because I couldn't find a common reason for it. Generally, sometimes deployments stop working, showing that the health check doesn't pass. When I stop the container and rerun the deployment in my CI/CD pipeline, it works again.
The error looks like this:
Additionally, I ran the deployment in verbose mode, and none of the containers returned a status of "unhealthy."
In my humble opinion, deploying multiple containers on one machine is a common use case. As I've investigated this issue for a while, I can say that the load balancer layer works just fine, and the containers are healthy. I assume the issue lies somewhere in Kamal's proxy and the way Kamal handles health checks.
I would love some hints or advice, or maybe there's something I’m doing wrong when defining health checks. Perhaps someone has successfully run such an architecture and can share the solution.
Those are my configs:
Backend app:
Frontend app:
I deploy containers from CI/CD pipeline using:
Thats how I build containers in my Gitlab CI, maybe it is important in that case: