basecamp / kamal

Deploy web apps anywhere.
https://kamal-deploy.org
MIT License
9.73k stars 373 forks source link

Old containers never stop on deploy (container not unhealthy) #915

Closed nikklavzar closed 15 hours ago

nikklavzar commented 2 weeks ago

When running kamal deploy, deploy fails when checking if the old container status is unhealthy after booting a new container: container not unhealthy (healthy), retrying in 3s (attempt 3/15)....

This only happens on Traefik-enabled roles. If I manually stop the old container, the deploy process proceeds normally. If not, all attempts fail and the command exits.

What is the possible reason that the old container is not being marked as unhealthy?

I'm on version 1.8.1, but this problem has been occurring to me for several versions.

timfsw commented 2 weeks ago

I have noticed the same error. We are currently proceeding as follows for the release:

echo "RAILS_MASTER_KEY=$RAILS_STAGING_KEY" >> .env
kamal env push -d staging
kamal deploy -d staging
djmb commented 1 week ago

@nikklavzar, @timfsw - what are the healthcheck commands for your containers?

You can find it with:

$ docker inspect <container_id> --format '{{ .Config.Healthcheck }}'
{[CMD-SHELL (curl -f http://localhost:3000/up -m 10 || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 1s 0s 0s 0}
nikklavzar commented 1 week ago

@nikklavzar, @timfsw - what are the healthcheck commands for your containers?

You can find it with:

$ docker inspect <container_id> --format '{{ .Config.Healthcheck }}'
{[CMD-SHELL (curl -f http://localhost:3000/up -m 10 || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 1s 0s 0s 0}

Backend: {[CMD-SHELL (curl -f http://localhost:8000/up/ || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 1m30s 0s 0s 0s 0}

Frontend: {[CMD-SHELL (/frontend/healthcheck.sh) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 1m30s 0s 0s 0s 0}

The /frontend/healthcheck.sh script just does exit 0 at the moment.

timfsw commented 1 week ago

@nikklavzar, @timfsw - what are the healthcheck commands for your containers?

You can find it with:

$ docker inspect <container_id> --format '{{ .Config.Healthcheck }}'
{[CMD-SHELL (curl -f http://localhost:3000/up -m 10 || exit 1) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 1s 0s 0s 0}

new container

docker inspect 341b1aae4c19 --format '{{ .Config.Healthcheck }}'

{[CMD-SHELL (true) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 10s 0s 0s 0s 0}

old container

docker inspect 47be2532bbc2 --format '{{ .Config.Healthcheck }}'

{[CMD-SHELL (true) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)] 10s 0s 0s 0s 0}

docker output

CONTAINER ID   COMMAND                  CREATED              STATUS                        PORTS   
341b1aae4c19   "/app/bin/docker-ent…"   About a minute ago   Up About a minute (healthy)   3000/tcp
47be2532bbc2   "/app/bin/docker-ent…"   7 minutes ago        Up 7 minutes (unhealthy)      3000/tcp
[
    {
        "Path": "/app/bin/docker-entrypoint",
        "Args": [
            "thrust",
            "bin/rails",
            "server"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 1214642,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2024-09-02T20:55:16.329829975Z",
            "FinishedAt": "0001-01-01T00:00:00Z",
            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 45,
                "Log": [
                    {
                        "Start": "2024-09-02T23:08:10.870066811+02:00",
                        "End": "2024-09-02T23:08:10.928903169+02:00",
                        "ExitCode": 1,
                        "Output": "stat: can't stat '/tmp/kamal-cord/cord': No such file or directory\n"
                    },
                    {
                        "Start": "2024-09-02T23:08:20.92943557+02:00",
                        "End": "2024-09-02T23:08:20.975688301+02:00",
                        "ExitCode": 1,
                        "Output": "stat: can't stat '/tmp/kamal-cord/cord': No such file or directory\n"
                    },
                    {
                        "Start": "2024-09-02T23:08:30.97686694+02:00",
                        "End": "2024-09-02T23:08:31.032613692+02:00",
                        "ExitCode": 1,
                        "Output": "stat: can't stat '/tmp/kamal-cord/cord': No such file or directory\n"
                    },
                    {
                        "Start": "2024-09-02T23:08:41.034067208+02:00",
                        "End": "2024-09-02T23:08:41.0832081+02:00",
                        "ExitCode": 1,
                        "Output": "stat: can't stat '/tmp/kamal-cord/cord': No such file or directory\n"
                    },
                    {
                        "Start": "2024-09-02T23:08:51.083956303+02:00",
                        "End": "2024-09-02T23:08:51.143450034+02:00",
                        "ExitCode": 1,
                        "Output": "stat: can't stat '/tmp/kamal-cord/cord': No such file or directory\n"
                    }
                ]
            }
        },
        "HostConfig": {
            "Binds": [
                "/home/ubuntu/.kamal/cords/staging-f6ad33cf1447e80c1168ff33a153ace1:/tmp/kamal-cord"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {
                    "env": "os,customer",
                    "labels": "production_status",
                    "max-file": "3",
                    "max-size": "10m"
                }
            },
            "NetworkMode": "staging",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "unless-stopped",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "ConsoleSize": [
                0,
                0
            ],
            "CapAdd": null,
            "CapDrop": null,
            "CgroupnsMode": "private",
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "private",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": null,
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": [],
            "BlkioDeviceReadBps": [],
            "BlkioDeviceWriteBps": [],
            "BlkioDeviceReadIOps": [],
            "BlkioDeviceWriteIOps": [],
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": null,
            "PidsLimit": null,
            "Ulimits": [],
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": [
                "/proc/asound",
                "/proc/acpi",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/proc/scsi",
                "/sys/firmware",
                "/sys/devices/virtual/powercap"
            ],
            "ReadonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        },
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/home/ubuntu/.kamal/cords/staging-f6ad33cf1447e80c1168ff33a153ace1",
                "Destination": "/tmp/kamal-cord",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            }
        ],
        "Config": {
            "Healthcheck": {
                "Test": [
                    "CMD-SHELL",
                    "(true) && (stat /tmp/kamal-cord/cord > /dev/null || exit 1)"
                ],
                "Interval": 10000000000
            },
            "Entrypoint": [
                "/app/bin/docker-entrypoint"
            ]
        },
    }
]

kamal config

healthcheck:
  cmd: true
  interval: "10s"
  max_attempts: 6
nikklavzar commented 1 week ago

Nevermind, my containers seem to need a lot of time for them to be marked unhealthy. I just increased the healthcheck attempts to 30 and it succeeded on attempt 22.

Cluster444 commented 4 days ago

@nikklavzar Sounds like something is holding a process open and not allowing a shutdown to occur. It's been forever since I had this issue but I seem to remember that happening in the old passenger days.

What is the backend?

nikklavzar commented 13 hours ago

What I didn't realise, but makes sense now, is that the healthcheck interval also affected this, so if I had a healthcheck interval set to 1 minute, then in the worst case it could have been registered as unhealthy a minute after it actually became unhealthy.