Closed nikklavzar closed 15 hours ago
I have noticed the same error. We are currently proceeding as follows for the release:
echo "RAILS_MASTER_KEY=$RAILS_STAGING_KEY" >> .env
kamal env push -d staging
kamal deploy -d staging
@nikklavzar, @timfsw - what are the healthcheck commands for your containers?
You can find it with:
$ docker inspect <container_id> --format '{{ .Config.Healthcheck }}'
{[CMD-SHELL (curl -f http://localhost:3000/up -m 10 || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 1s 0s 0s 0}
@nikklavzar, @timfsw - what are the healthcheck commands for your containers?
You can find it with:
$ docker inspect <container_id> --format '{{ .Config.Healthcheck }}' {[CMD-SHELL (curl -f http://localhost:3000/up -m 10 || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 1s 0s 0s 0}
Backend:
{[CMD-SHELL (curl -f http://localhost:8000/up/ || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 1m30s 0s 0s 0s 0}
Frontend:
{[CMD-SHELL (/frontend/healthcheck.sh) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 1m30s 0s 0s 0s 0}
The /frontend/healthcheck.sh
script just does exit 0
at the moment.
@nikklavzar, @timfsw - what are the healthcheck commands for your containers?
You can find it with:
$ docker inspect <container_id> --format '{{ .Config.Healthcheck }}' {[CMD-SHELL (curl -f http://localhost:3000/up -m 10 || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 1s 0s 0s 0}
docker inspect 341b1aae4c19 --format '{{ .Config.Healthcheck }}'
{[CMD-SHELL (true) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 10s 0s 0s 0s 0}
docker inspect 47be2532bbc2 --format '{{ .Config.Healthcheck }}'
{[CMD-SHELL (true) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 10s 0s 0s 0s 0}
CONTAINER ID COMMAND CREATED STATUS PORTS
341b1aae4c19 "/app/bin/docker-ent…" About a minute ago Up About a minute (healthy) 3000/tcp
47be2532bbc2 "/app/bin/docker-ent…" 7 minutes ago Up 7 minutes (unhealthy) 3000/tcp
[
{
"Path": "/app/bin/docker-entrypoint",
"Args": [
"thrust",
"bin/rails",
"server"
],
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 1214642,
"ExitCode": 0,
"Error": "",
"StartedAt": "2024-09-02T20:55:16.329829975Z",
"FinishedAt": "0001-01-01T00:00:00Z",
"Health": {
"Status": "unhealthy",
"FailingStreak": 45,
"Log": [
{
"Start": "2024-09-02T23:08:10.870066811+02:00",
"End": "2024-09-02T23:08:10.928903169+02:00",
"ExitCode": 1,
"Output": "stat: can't stat '/tmp/kamal-cord/cord': No such file or directory\n"
},
{
"Start": "2024-09-02T23:08:20.92943557+02:00",
"End": "2024-09-02T23:08:20.975688301+02:00",
"ExitCode": 1,
"Output": "stat: can't stat '/tmp/kamal-cord/cord': No such file or directory\n"
},
{
"Start": "2024-09-02T23:08:30.97686694+02:00",
"End": "2024-09-02T23:08:31.032613692+02:00",
"ExitCode": 1,
"Output": "stat: can't stat '/tmp/kamal-cord/cord': No such file or directory\n"
},
{
"Start": "2024-09-02T23:08:41.034067208+02:00",
"End": "2024-09-02T23:08:41.0832081+02:00",
"ExitCode": 1,
"Output": "stat: can't stat '/tmp/kamal-cord/cord': No such file or directory\n"
},
{
"Start": "2024-09-02T23:08:51.083956303+02:00",
"End": "2024-09-02T23:08:51.143450034+02:00",
"ExitCode": 1,
"Output": "stat: can't stat '/tmp/kamal-cord/cord': No such file or directory\n"
}
]
}
},
"HostConfig": {
"Binds": [
"/home/ubuntu/.kamal/cords/staging-f6ad33cf1447e80c1168ff33a153ace1:/tmp/kamal-cord"
],
"ContainerIDFile": "",
"LogConfig": {
"Type": "json-file",
"Config": {
"env": "os,customer",
"labels": "production_status",
"max-file": "3",
"max-size": "10m"
}
},
"NetworkMode": "staging",
"PortBindings": {},
"RestartPolicy": {
"Name": "unless-stopped",
"MaximumRetryCount": 0
},
"AutoRemove": false,
"VolumeDriver": "",
"VolumesFrom": null,
"ConsoleSize": [
0,
0
],
"CapAdd": null,
"CapDrop": null,
"CgroupnsMode": "private",
"Dns": [],
"DnsOptions": [],
"DnsSearch": [],
"ExtraHosts": null,
"GroupAdd": null,
"IpcMode": "private",
"Cgroup": "",
"Links": null,
"OomScoreAdj": 0,
"PidMode": "",
"Privileged": false,
"PublishAllPorts": false,
"ReadonlyRootfs": false,
"SecurityOpt": null,
"UTSMode": "",
"UsernsMode": "",
"ShmSize": 67108864,
"Runtime": "runc",
"Isolation": "",
"CpuShares": 0,
"Memory": 0,
"NanoCpus": 0,
"CgroupParent": "",
"BlkioWeight": 0,
"BlkioWeightDevice": [],
"BlkioDeviceReadBps": [],
"BlkioDeviceWriteBps": [],
"BlkioDeviceReadIOps": [],
"BlkioDeviceWriteIOps": [],
"CpuPeriod": 0,
"CpuQuota": 0,
"CpuRealtimePeriod": 0,
"CpuRealtimeRuntime": 0,
"CpusetCpus": "",
"CpusetMems": "",
"Devices": [],
"DeviceCgroupRules": null,
"DeviceRequests": null,
"MemoryReservation": 0,
"MemorySwap": 0,
"MemorySwappiness": null,
"OomKillDisable": null,
"PidsLimit": null,
"Ulimits": [],
"CpuCount": 0,
"CpuPercent": 0,
"IOMaximumIOps": 0,
"IOMaximumBandwidth": 0,
"MaskedPaths": [
"/proc/asound",
"/proc/acpi",
"/proc/kcore",
"/proc/keys",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/proc/scsi",
"/sys/firmware",
"/sys/devices/virtual/powercap"
],
"ReadonlyPaths": [
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
]
},
"Mounts": [
{
"Type": "bind",
"Source": "/home/ubuntu/.kamal/cords/staging-f6ad33cf1447e80c1168ff33a153ace1",
"Destination": "/tmp/kamal-cord",
"Mode": "",
"RW": true,
"Propagation": "rprivate"
}
],
"Config": {
"Healthcheck": {
"Test": [
"CMD-SHELL",
"(true) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)"
],
"Interval": 10000000000
},
"Entrypoint": [
"/app/bin/docker-entrypoint"
]
},
}
]
healthcheck:
cmd: true
interval: "10s"
max_attempts: 6
Nevermind, my containers seem to need a lot of time for them to be marked unhealthy. I just increased the healthcheck attempts to 30 and it succeeded on attempt 22.
@nikklavzar Sounds like something is holding a process open and not allowing a shutdown to occur. It's been forever since I had this issue but I seem to remember that happening in the old passenger days.
What is the backend?
What I didn't realise, but makes sense now, is that the healthcheck interval also affected this, so if I had a healthcheck interval set to 1 minute, then in the worst case it could have been registered as unhealthy a minute after it actually became unhealthy.
When running
kamal deploy
, deploy fails when checking if the old container status is unhealthy after booting a new container:container not unhealthy (healthy), retrying in 3s (attempt 3/15)...
.This only happens on Traefik-enabled roles. If I manually stop the old container, the deploy process proceeds normally. If not, all attempts fail and the command exits.
What is the possible reason that the old container is not being marked as unhealthy?
I'm on version 1.8.1, but this problem has been occurring to me for several versions.