Open frenkel opened 1 week ago
@frenkel, have you seen the following issues?
I wonder if some of the proposed workarounds could be applied to your situation.
This issue, a lot of the time, boils down to a configuration or something within the runtime environment rather than an actual bug within kamal-proxy or kamal itself. So it's worth pursuing the avenue of adjusting the configuration first before trying other things.
Thanks for the reply, but it works fine in the other 14 out of 15 deploys. I use thruster and port 80, so no need to change the app_port.
I am having a hard time reproducing this locally. Even with a container that deliberately is slow (starts slow and staggers replies to the /up
endpoint) with its responses.
Can you share some more details, if possible, about your setup, application and deployment, etc. Anything custom? Anything special or bespoke that we might not be handling correctly?
I also wonder if there would be a way for you to build (or come up with) a minimal reproducer that triggers this issue reliably. If not, then no worries. We can think of something...
Thanks for looking into this. I hadn’t thought about a timeout as the docs say it is 30 seconds and the first 200 is after 17 seconds. Do you think I need to increase it above 30 seconds?
The application is nothing special. Just Rails 7.2 with postgresql, puma and thruster. The server is a staging server that doesn’t receive much traffic, just one or two test users. We did experience it on a production server as well though.
Is the “ context canceled” error normal? And the “Error: null”?
This is my config:
deploy.staging.yml
servers:
web:
hosts:
- staging.example.com
jobs:
hosts:
- staging.example.com
cmd: "bin/jobs"
proxy:
ssl: true
hosts:
- beheer.example.dev
- inloggen.example.dev
- development.example.dev
- development0.example.dev
- development1.example.dev
- development2.example.dev
- development3.example.dev
- development4.example.dev
- development5.example.dev
- development6.example.dev
- development7.example.dev
- development8.example.dev
- development9.example.dev
- signup.example.dev
- demo.example.dev
env:
clear:
RAILS_ENV: staging
ASSUME_SSL: true
RAILS_MAX_THREADS: 3
WEB_CONCURRENCY: 2
ssh:
user: staging
deploy.yml
service: platform
image: example-com/platform
logging:
driver: journald
servers: [] # overridden in destination specific files
registry:
server: example-container-registry.com
username: platform
password:
- KAMAL_REGISTRY_PASSWORD
builder:
arch: amd64
cache:
type: gha
options: mode=max
image: kamal-app-build-cache
# Inject ENV variables into containers (secrets come from .kamal/secrets).
#
env:
clear:
TZ: Europe/Amsterdam
secret:
- DATABASE_URL
aliases:
console: app exec --interactive --reuse "bin/rails console"
shell: app exec --interactive --reuse "bash"
logs: app logs -f
dbc: app exec --interactive --reuse "bin/rails dbconsole"
volumes:
- "app_storage:/rails/storage"
- "app_cache:/rails/tmp/cache" # for session storage
- "app_credentials:/rails/config/credentials"
asset_path: /rails/public/assets
require_destination: true
accessories:
db:
image: postgres:15
roles:
- web
port: "127.0.0.1:5432:5432"
env:
secret:
- POSTGRES_PASSWORD
- POSTGRES_USER
- POSTGRES_DB
volumes:
- "db_data:/var/lib/postgresql/data"
@frenkel, thank you for more details! A few more questions...
/etc/os-release
file here)
cat /proc/version
and /proc/cmdline
would be useful, toodocker info
and docker version
command here)That said, looking at the system logs under the system-wide /var/log
directory and the output of the dmesg
command is also a good idea. Perhaps there is some signal there.
You could try to double the time out as an experiment, to see what the outcome would be. As you have a staging environment handy, then there should be, hopefully, no harm done in doing so. You know, for science!
See for reference:
The context canceled
error we are seeing is a timeout and comes from Thruster, per:
Some of the other logs also come from Thruster and originate from the request logging middleware:
The upstream wasn't available for a short time per:
2024-11-22T08:19:10.7616212Z 2024-11-22T08:18:41.619040000Z {"time":"2024-11-22T09:18:41.618760816+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"dial tcp 127.0.0.1:3000: connect: connection refused"}
Then, the listener eventually became ready to listen for an incoming connection, but it did not respond to the health check in a timely manner, hence the timeout.
There are some 502s, too, but I am not sure about these. Perhaps the upstream (backend) is not ready yet.
Where possible, you could enable the debug log level and then run deployments to your staging environment, perhaps several times, with the aim of triggering the issue.
If you won't be able to reproduce this with repeated deployments of the same release/version, then perhaps the issue is with any new version? Possibly, something runs as part of a new release (e.g., expensive migration, some extra scripts and tasks, etc.) that would stagger the service from starting up in a timely manner.
@kevinmcconnell, I am not sure if this issue belongs to kamal-proxy, and maybe it would be better to move it under the main Kamal project?
Thanks for the help, but I don’t see how any of this affects two http code 200s as being ignored.
This is just Debian stable with docker from the repositories. There is nothing of note in the logs. Debug level logging won’t help here as it only adds SQL queries which I won’t be able to post. dmesg output are kernel messages, so they have no useful content for a container that doesn’t get marked as healthy.
@frenkel, I am trying to understand what happens on your machine.
Without a reliable reproducer, we can only speculate, especially since I cannot reproduce this issue locally.
Yes, I know what dmesg
is for.
Of course, I understand. Some of those just seemed so unrelated. I will try to experiment and research it some more. It seems a successful deploy doesn’t log the healthcheck requests. Any idea how to see them?
@frenkel could you share the logs from the proxy container, from the start of the deployment through to the failure? The proxy should be logging its activity, and will include the point where if finds the target to be healthy (if it does). An example of proxy logs for a successful deployment:
{"time":"2024-11-26T08:47:33.832254938Z","level":"INFO","msg":"Deploying","service":"s1","hosts":null,"target":"example-web-1","tls":false}
{"time":"2024-11-26T08:47:33.833367109Z","level":"INFO","msg":"Healthcheck succeeded","status":200}
{"time":"2024-11-26T08:47:33.833382878Z","level":"INFO","msg":"Target health updated","target":"example-web-1","success":true,"state":"healthy"}
{"time":"2024-11-26T08:47:33.833398898Z","level":"INFO","msg":"Deployed","service":"s1","hosts":null,"target":"example-web-1"}
The "target failed to become healthy" error should only happen if the timeout elapses before the proxy receives a healthy response, and my guess is that's what's happening here. But the proxy logs should confirm if that's the case. (The successful /up
responses are a bit odd, but I wonder if the healthcheck ran past the end of the timeout; I'll look into that.)
The "context cancelled" errors are, as @kwilczynski says, due to the target failing to respond to a healthcheck request in time. So those are expected in this case. The default timeout for each individual healthcheck request is 5s, so those healthcheck requests are being cancelled after 5s and retried. Retrying should continue until the overall deployment timeout elapses (30s by default).
Apparently the log setting of my kamal config is not passed to kamal-proxy, so the logs of the kamal-proxy container use the default json docker logs which seem to have been cut of at the end of November the 23rd. Earlier logs are removed and I cannot see the lines you asked for. As soon as it occurs again I'll get those logs and post them here.
By the way: should kamal-proxy honor the log settings in kamal configuration file? If so, should I file a bug there?
As soon as it occurs again I'll get those logs and post them here.
Sounds good, thanks! I think it's likely that when you get the failure, it's because the app took too long to become healthy. But when we have those logs we'll be able to confirm.
I've also found that the logs do indeed sometimes show some additional healthchecks after the timeout elapses. Those checks are fairly harmless (their result is ignored) but they make it confusing to debug situations like this one. I'll have a fix for that in place shortly.
By the way: should kamal-proxy honor the log settings in kamal configuration file? If so, should I file a bug there?
It's worth checking on the Kamal project about this. I think it makes sense for the log settings to apply everywhere, however it's possible for a single proxy container to be shared by multiple apps, so there may be some complications to consider. So it's worth bringing up the question I think.
We've been using kamal with kamal-proxy for a few weeks now and overall it works great. We do have a random case (1 in 15?) where the deployment fails because the new version doesn't become healthy according to Kamal, although the logs do show correct status code 200 for the healthcheck. Attached the logs of such a failure:
What could be the cause of this and how can we prevent it?