basecamp / kamal-proxy

Lightweight proxy server for Kamal
https://kamal-deploy.org/
MIT License
754 stars 31 forks source link

Randomly fails to become healthy #71

Open frenkel opened 1 week ago

frenkel commented 1 week ago

We've been using kamal with kamal-proxy for a few weeks now and overall it works great. We do have a random case (1 in 15?) where the deployment fails because the new version doesn't become healthy according to Kamal, although the logs do show correct status code 200 for the healthcheck. Attached the logs of such a failure:

2024-11-22T08:19:10.7613515Z 2024-11-22T08:18:40.620061000Z {"time":"2024-11-22T09:18:40.619937363+01:00","level":"INFO","msg":"Request","path":"/up","status":502,"dur":0,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":626,"resp_content_type":"text/html","remote_addr":"172.18.0.3:52158","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7616212Z 2024-11-22T08:18:41.619040000Z {"time":"2024-11-22T09:18:41.618760816+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"dial tcp 127.0.0.1:3000: connect: connection refused"}
2024-11-22T08:19:10.7618864Z 2024-11-22T08:18:41.632961000Z {"time":"2024-11-22T09:18:41.629230642+01:00","level":"INFO","msg":"Request","path":"/up","status":502,"dur":10,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":626,"resp_content_type":"text/html","remote_addr":"172.18.0.3:52166","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7620828Z 2024-11-22T08:18:42.275092000Z [21] Puma starting in cluster mode...
2024-11-22T08:19:10.7621369Z 2024-11-22T08:18:42.276457000Z [21] * Puma version: 6.4.3 (ruby 3.1.6-p260) ("The Eagle of Durango")
2024-11-22T08:19:10.7621866Z 2024-11-22T08:18:42.277440000Z [21] *  Min threads: 3
2024-11-22T08:19:10.7622224Z 2024-11-22T08:18:42.277496000Z [21] *  Max threads: 3
2024-11-22T08:19:10.7622593Z 2024-11-22T08:18:42.277526000Z [21] *  Environment: staging
2024-11-22T08:19:10.7622974Z 2024-11-22T08:18:42.278793000Z [21] *   Master PID: 21
2024-11-22T08:19:10.7623347Z 2024-11-22T08:18:42.279629000Z [21] *      Workers: 2
2024-11-22T08:19:10.7623991Z 2024-11-22T08:18:42.280388000Z [21] *     Restarts: (✔) hot (✖) phased
2024-11-22T08:19:10.7624421Z 2024-11-22T08:18:42.281572000Z [21] * Preloading application
2024-11-22T08:19:10.7624867Z 2024-11-22T08:18:42.281619000Z [21] * Listening on http://0.0.0.0:3000
2024-11-22T08:19:10.7625292Z 2024-11-22T08:18:42.281651000Z [21] Use Ctrl-C to stop
2024-11-22T08:19:10.7625738Z 2024-11-22T08:18:42.312462000Z [21] - Worker 0 (PID: 36) booted in 0.02s, phase: 0
2024-11-22T08:19:10.7626247Z 2024-11-22T08:18:42.328429000Z [21] - Worker 1 (PID: 37) booted in 0.03s, phase: 0
2024-11-22T08:19:10.7626948Z 2024-11-22T08:18:47.622613000Z {"time":"2024-11-22T09:18:47.620309603+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"context canceled"}
2024-11-22T08:19:10.7628306Z 2024-11-22T08:18:47.623511000Z {"time":"2024-11-22T09:18:47.620907601+01:00","level":"INFO","msg":"Request","path":"/up","status":502,"dur":5001,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":626,"resp_content_type":"text/html","remote_addr":"172.18.0.3:52172","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7629858Z 2024-11-22T08:18:53.380502000Z {"time":"2024-11-22T09:18:53.379131964+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"context canceled"}
2024-11-22T08:19:10.7631211Z 2024-11-22T08:18:53.381686000Z {"time":"2024-11-22T09:18:53.379478313+01:00","level":"INFO","msg":"Request","path":"/up","status":502,"dur":5755,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":626,"resp_content_type":"text/html","remote_addr":"172.18.0.3:52184","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7632528Z 2024-11-22T08:18:58.388605000Z {"time":"2024-11-22T09:18:58.383575978+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"context canceled"}
2024-11-22T08:19:10.7633872Z 2024-11-22T08:18:58.389712000Z {"time":"2024-11-22T09:18:58.383848059+01:00","level":"INFO","msg":"Request","path":"/up","status":502,"dur":4999,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":626,"resp_content_type":"text/html","remote_addr":"172.18.0.3:46734","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7635860Z 2024-11-22T08:18:59.716690000Z {"time":"2024-11-22T09:18:59.716474635+01:00","level":"INFO","msg":"Request","path":"/up","status":200,"dur":1331,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":73,"resp_content_type":"text/html; charset=utf-8","remote_addr":"172.18.0.3:46748","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:10.7637841Z 2024-11-22T08:18:59.729143000Z {"time":"2024-11-22T09:18:59.728924097+01:00","level":"INFO","msg":"Request","path":"/up","status":200,"dur":9,"method":"GET","req_content_length":0,"req_content_type":"","resp_content_length":73,"resp_content_type":"text/html; charset=utf-8","remote_addr":"172.18.0.3:46750","user_agent":"Go-http-client/1.1","cache":"miss","query":""}
2024-11-22T08:19:19.7084400Z   INFO [fdc801ec] Running docker container ls --all --filter name=^platform-web-staging-18258ef278ae34862e4efeb855abf1960a3a0e37$ --quiet | xargs docker inspect --format '{{json .State.Health}}' on staging.toolsfactory.nl
2024-11-22T08:19:19.7086005Z   INFO [fdc801ec] Finished in 0.453 seconds with exit status 0 (successful).
2024-11-22T08:19:19.7086594Z  ERROR null
2024-11-22T08:19:19.7087661Z   INFO [bb6b6f8e] Running docker container ls --all --filter name=^platform-web-staging-18258ef278ae34862e4efeb855abf1960a3a0e37$ --quiet | xargs docker stop on staging.toolsfactory.nl
2024-11-22T08:19:19.7088948Z   INFO [bb6b6f8e] Finished in 8.501 seconds with exit status 0 (successful).
2024-11-22T08:19:19.7089717Z Releasing the deploy lock...
2024-11-22T08:19:20.0281473Z   Finished all in 68.9 seconds
2024-11-22T08:19:20.0305461Z   ERROR (SSHKit::Command::Failed): Exception while executing on host staging.toolsfactory.nl: docker exit status: 1
2024-11-22T08:19:20.0306650Z docker stdout: Nothing written
2024-11-22T08:19:20.0307308Z docker stderr: Error: target failed to become healthy

What could be the cause of this and how can we prevent it?

kwilczynski commented 5 days ago

@frenkel, have you seen the following issues?

I wonder if some of the proposed workarounds could be applied to your situation.

This issue, a lot of the time, boils down to a configuration or something within the runtime environment rather than an actual bug within kamal-proxy or kamal itself. So it's worth pursuing the avenue of adjusting the configuration first before trying other things.

frenkel commented 5 days ago

Thanks for the reply, but it works fine in the other 14 out of 15 deploys. I use thruster and port 80, so no need to change the app_port.

kwilczynski commented 5 days ago

I am having a hard time reproducing this locally. Even with a container that deliberately is slow (starts slow and staggers replies to the /up endpoint) with its responses.

Can you share some more details, if possible, about your setup, application and deployment, etc. Anything custom? Anything special or bespoke that we might not be handling correctly?

I also wonder if there would be a way for you to build (or come up with) a minimal reproducer that triggers this issue reliably. If not, then no worries. We can think of something...

frenkel commented 5 days ago

Thanks for looking into this. I hadn’t thought about a timeout as the docs say it is 30 seconds and the first 200 is after 17 seconds. Do you think I need to increase it above 30 seconds?

The application is nothing special. Just Rails 7.2 with postgresql, puma and thruster. The server is a staging server that doesn’t receive much traffic, just one or two test users. We did experience it on a production server as well though.

Is the “ context canceled” error normal? And the “Error: null”?

frenkel commented 4 days ago

This is my config:

deploy.staging.yml

servers:
  web:
    hosts:
      - staging.example.com
  jobs:
    hosts:
      - staging.example.com
    cmd: "bin/jobs"

proxy:
  ssl: true
  hosts:
    - beheer.example.dev
    - inloggen.example.dev
    - development.example.dev
    - development0.example.dev
    - development1.example.dev
    - development2.example.dev
    - development3.example.dev
    - development4.example.dev
    - development5.example.dev
    - development6.example.dev
    - development7.example.dev
    - development8.example.dev
    - development9.example.dev
    - signup.example.dev
    - demo.example.dev

env:
  clear:
    RAILS_ENV: staging
    ASSUME_SSL: true
    RAILS_MAX_THREADS: 3
    WEB_CONCURRENCY: 2

ssh:
  user: staging

deploy.yml

service: platform
image: example-com/platform
logging:
  driver: journald

servers: [] # overridden in destination specific files

registry:
  server: example-container-registry.com
  username: platform
  password:
    - KAMAL_REGISTRY_PASSWORD

builder:
  arch: amd64
  cache:
    type: gha
    options: mode=max
    image: kamal-app-build-cache

# Inject ENV variables into containers (secrets come from .kamal/secrets).
#
env:
  clear:
    TZ: Europe/Amsterdam
  secret:
    - DATABASE_URL

aliases:
  console: app exec --interactive --reuse "bin/rails console"
  shell: app exec --interactive --reuse "bash"
  logs: app logs -f
  dbc: app exec --interactive --reuse "bin/rails dbconsole"

volumes:
  - "app_storage:/rails/storage"
  - "app_cache:/rails/tmp/cache" # for session storage
  - "app_credentials:/rails/config/credentials"

asset_path: /rails/public/assets

require_destination: true

accessories:
  db:
    image: postgres:15
    roles:
      - web
    port: "127.0.0.1:5432:5432"
    env:
      secret:
        - POSTGRES_PASSWORD
        - POSTGRES_USER
        - POSTGRES_DB
    volumes:
      - "db_data:/var/lib/postgresql/data"
kwilczynski commented 4 days ago

@frenkel, thank you for more details! A few more questions...

That said, looking at the system logs under the system-wide /var/log directory and the output of the dmesg command is also a good idea. Perhaps there is some signal there.

You could try to double the time out as an experiment, to see what the outcome would be. As you have a staging environment handy, then there should be, hopefully, no harm done in doing so. You know, for science!

See for reference:

The context canceled error we are seeing is a timeout and comes from Thruster, per:

Some of the other logs also come from Thruster and originate from the request logging middleware:

The upstream wasn't available for a short time per:

2024-11-22T08:19:10.7616212Z 2024-11-22T08:18:41.619040000Z {"time":"2024-11-22T09:18:41.618760816+01:00","level":"INFO","msg":"Unable to proxy request","path":"/up","error":"dial tcp 127.0.0.1:3000: connect: connection refused"}

Then, the listener eventually became ready to listen for an incoming connection, but it did not respond to the health check in a timely manner, hence the timeout.

There are some 502s, too, but I am not sure about these. Perhaps the upstream (backend) is not ready yet.

Where possible, you could enable the debug log level and then run deployments to your staging environment, perhaps several times, with the aim of triggering the issue.

If you won't be able to reproduce this with repeated deployments of the same release/version, then perhaps the issue is with any new version? Possibly, something runs as part of a new release (e.g., expensive migration, some extra scripts and tasks, etc.) that would stagger the service from starting up in a timely manner.

@kevinmcconnell, I am not sure if this issue belongs to kamal-proxy, and maybe it would be better to move it under the main Kamal project?

frenkel commented 4 days ago

Thanks for the help, but I don’t see how any of this affects two http code 200s as being ignored.

This is just Debian stable with docker from the repositories. There is nothing of note in the logs. Debug level logging won’t help here as it only adds SQL queries which I won’t be able to post. dmesg output are kernel messages, so they have no useful content for a container that doesn’t get marked as healthy.

kwilczynski commented 4 days ago

@frenkel, I am trying to understand what happens on your machine.

Without a reliable reproducer, we can only speculate, especially since I cannot reproduce this issue locally.

Yes, I know what dmesg is for.

frenkel commented 4 days ago

Of course, I understand. Some of those just seemed so unrelated. I will try to experiment and research it some more. It seems a successful deploy doesn’t log the healthcheck requests. Any idea how to see them?

kevinmcconnell commented 3 days ago

@frenkel could you share the logs from the proxy container, from the start of the deployment through to the failure? The proxy should be logging its activity, and will include the point where if finds the target to be healthy (if it does). An example of proxy logs for a successful deployment:

{"time":"2024-11-26T08:47:33.832254938Z","level":"INFO","msg":"Deploying","service":"s1","hosts":null,"target":"example-web-1","tls":false}
{"time":"2024-11-26T08:47:33.833367109Z","level":"INFO","msg":"Healthcheck succeeded","status":200}
{"time":"2024-11-26T08:47:33.833382878Z","level":"INFO","msg":"Target health updated","target":"example-web-1","success":true,"state":"healthy"}
{"time":"2024-11-26T08:47:33.833398898Z","level":"INFO","msg":"Deployed","service":"s1","hosts":null,"target":"example-web-1"}

The "target failed to become healthy" error should only happen if the timeout elapses before the proxy receives a healthy response, and my guess is that's what's happening here. But the proxy logs should confirm if that's the case. (The successful /up responses are a bit odd, but I wonder if the healthcheck ran past the end of the timeout; I'll look into that.)

The "context cancelled" errors are, as @kwilczynski says, due to the target failing to respond to a healthcheck request in time. So those are expected in this case. The default timeout for each individual healthcheck request is 5s, so those healthcheck requests are being cancelled after 5s and retried. Retrying should continue until the overall deployment timeout elapses (30s by default).

frenkel commented 3 days ago

Apparently the log setting of my kamal config is not passed to kamal-proxy, so the logs of the kamal-proxy container use the default json docker logs which seem to have been cut of at the end of November the 23rd. Earlier logs are removed and I cannot see the lines you asked for. As soon as it occurs again I'll get those logs and post them here.

By the way: should kamal-proxy honor the log settings in kamal configuration file? If so, should I file a bug there?

kevinmcconnell commented 3 days ago

As soon as it occurs again I'll get those logs and post them here.

Sounds good, thanks! I think it's likely that when you get the failure, it's because the app took too long to become healthy. But when we have those logs we'll be able to confirm.

I've also found that the logs do indeed sometimes show some additional healthchecks after the timeout elapses. Those checks are fairly harmless (their result is ignored) but they make it confusing to debug situations like this one. I'll have a fix for that in place shortly.

By the way: should kamal-proxy honor the log settings in kamal configuration file? If so, should I file a bug there?

It's worth checking on the Kamal project about this. I think it makes sense for the log settings to apply everywhere, however it's possible for a single proxy container to be shared by multiple apps, so there may be some complications to consider. So it's worth bringing up the question I think.