docker-archive / compose-cli

Easily run your Compose application to the cloud with compose-cli
Apache License 2.0
956 stars 253 forks source link

Healthcheck for EFS fails when part of db server initialisation process is stopping and restarting #2202

Closed ahmed2m closed 1 year ago

ahmed2m commented 1 year ago

Description

I have a docker-compose setup that include postgres database service. Apparently part of Postgres server initialisation process is stopping and restarting.

part of a normal postgres log that I saw parts of breifly in EFS log on the console before teh compose-cli deleted everything.

local-postgres9.5 | LOG:  received fast shutdown request
local-postgres9.5 | LOG:  aborting any active transactions
local-postgres9.5 | LOG:  autovacuum launcher shutting down
local-postgres9.5 | LOG:  shutting down
local-postgres9.5 | waiting for server to shut down....LOG:  database system is shut down
local-postgres9.5 |  done
local-postgres9.5 | server stopped
local-postgres9.5 |
local-postgres9.5 | PostgreSQL init process complete; ready for start up.
local-postgres9.5 |
local-postgres9.5 | LOG:  database system was shut down at 2016-05-16 16:51:55 UTC
local-postgres9.5 | LOG:  MultiXact member wraparound protections are now enabled
local-postgres9.5 | LOG:  database system is ready to accept connections

FROM: this SO question as I wasn't fast enough to copy from the console

Steps to reproduce the issue:

  1. Run docker compose --project-name name --file file.yml up with a docker-compose file similar to mine.

Describe the results you received: After the DbService is created successfully and while three other services are being created, the health check fails for DbService and after deleting everything, the error I get is: DbService ServiceSchedulerInitiated: Task failed ELB health checks in (target-group arn:aws:elasticloadbalancing:us-east-1:[...])

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker-compose --version: Docker Compose version 2.12.2

Output of docker version:

Client:
 Cloud integration: v1.0.29
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.19.2
 Git commit:        baeda1f82a
 Built:             Thu Oct 27 21:30:31 2022
 OS/Arch:           linux/amd64
 Context:           myecscontext
 Experimental:      true

Server:
 Engine:
  Version:          20.10.21
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.19.2
  Git commit:       3056208812
  Built:            Thu Oct 27 21:29:34 2022
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          v1.6.9
  GitCommit:        1c90a442489720eec95342e1789ee8a5e1b9536f.m
 runc:
  Version:          1.1.4
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker context show:
You can also run docker context inspect context-name to give us more details but don't forget to remove sensitive content.

[
    {
        "Name": "myecscontext",
        "Metadata": {
            "Type": "ecs"
        },
        "Endpoints": {
            "docker": {
                "SkipTLSVerify": false
            },
            "ecs": {
                "Profile": "default"
            }
        },
        "TLSMaterial": {},
        "Storage": {
            "MetadataPath": "/home/ahmed/.docker/contexts/meta/8bb20fb47c4248774ab660063891d2332facb5997d914f61b3bdfeb471eaddba",
            "TLSPath": "/home/ahmed/.docker/contexts/tls/8bb20fb47c4248774ab660063891d2332facb5997d914f61b3bdfeb471eaddba"
        }
    }
]

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  compose: Docker Compose (Docker Inc., v2.12.2)

Server:
 Containers: 15
  Running: 0
  Paused: 0
  Stopped: 15
 Images: 87
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1c90a442489720eec95342e1789ee8a5e1b9536f.m
 runc version: 
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.17.9-1-MANJARO
 Operating System: Manjaro Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 15.06GiB
 Name: probook
 ID: WFF4:4RZ7:KAVK:OCJC:BUIH:CNV6:4LT2:E3QF:YU6D:YQJN:5ZFA:S3B2
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS ECS, Azure ACI, local, etc.): My docker-compose generates about 88 tasks and takes a while to startup some of the services.

version: "3"

networks:
  the-network:
    driver: bridge

services:
  db:
    image: postgres:14.5
    env_file:
      - .env
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - postgres_backups:/backups
    networks:
      - the-network
    ports:
      - "5435:5435"

  redis:
    image: redis:5.0

  api: &api
    image: example.com:5050/repo/api:${VERSION}
    x-aws-pull_credentials: arn:aws:secretsmanager:us-east-1:[...]
    build:
      context: backend
    env_file:
      - .env
    environment:
      DJANGO_SETTINGS_MODULE: "api.settings.testing_env"
    volumes:
      - media_storage:/storage/django_media/:rw
      - static_storage:/storage/django_static/:rw
    networks:
      - the-network
    ports:
      - "8000:8000"
    depends_on:
      - db
      - redis

  celeryworker:
    <<: *api
    image: example.com:5050/repo/celeryworker:${VERSION}
    x-aws-pull_credentials: arn:aws:secretsmanager:us-east-1:[...]
    ports:
      - "8001:8001"
    command: bash -c "cd api;
      poetry run celery -A api worker -l info"
    depends_on:
      - api

  celerybeat:
    <<: *api
    image: example.com:5050/repo/celerybeat:${VERSION}
    x-aws-pull_credentials: arn:aws:secretsmanager:us-east-1:[...]
    networks:
      - the-network
    ports:
      - "8002:8002"
    command: bash -c "cd api;
      poetry run celery -A api beat -l info --pidfile=''"
    depends_on:
      - api

  frontend:
    image: example.com:5050/repo/frontend:${VERSION}
    x-aws-pull_credentials: arn:aws:secretsmanager:us-east-1:[...]
    build:
      context: frontend
      dockerfile: Dockerfile.prod
    networks:
      - the-network
    ports:
      - "3000:3000"
    env_file:
      - .env

  nginx:
    image: nginx:latest
    build:
      context: nginx
      args:
        API_INTERNAL_HOST: api:8000
        WEB_INTERNAL_HOST: frontend:3000
        API_HOSTNAME: ${API_HOSTNAME}
        WEB_HOSTNAME: ${WEB_HOSTNAME}
    command: '/bin/sh -c ''while :; do sleep 6h & wait $${!}; nginx -s reload; done & nginx -g "daemon off;"'''
    depends_on:
      - api
      - frontend
    volumes:
      - static_storage:/storage/django_static/:rw
      - media_storage:/storage/django_media/:rw
    env_file:
      - .env
    networks:
      - the-network
    ports:
      - "80:80"

volumes:
  postgres_data: {}
  postgres_backups: {}
  static_storage: {}
  media_storage: {}

NOTE: I tried to disable health check to make sure it would all work after successful deployment:

    healthcheck:
      disable: true

That generated HealthCheck: {} in DbTaskDefinition

but faced that error

Resource handler returned message: "Invalid request provided: Create TaskDefinition: You must specify a health check command for container 'db' (Service: AmazonECS; Status Code: 400; Error Code: ClientException; Request ID: 865908ff-bb32-4b54-b541-6c5df3cdb0c5; Proxy: null)" (RequestToken: 94984544-ffd6-d54f-bf7a-b942b6ba606a, HandlerErrorCode: InvalidRequest)
--
ahmed2m commented 1 year ago

If anyone is here and wondering, it's something related to the image being a postgres container, healthcheck needs to be defined for example:

    healthcheck:
      test: ["CMD-SHELL", "pg_isready", "-d", "${POSTGRES_DB}"]
      interval: 30s
      timeout: 60s
      retries: 5