Postgres restarting when archive_command fails

sumanmichael commented 1 year ago

Issue Description

Current Behavior: When the archive command fails, the PostgreSQL database/container restarts intermittently. Note: This behavior is not observed in bare metal/non-Docker installations of PostgreSQL.

Expected Behavior: The PostgreSQL database/container should handle failures of the archive command gracefully, without causing intermittent restarts.

Steps to Reproduce

In docker-compose.yml, add flags to postgres command archive_mode=on, archive_command=/bin/false (here, /bin/false is used to simulate any failing command) and log_min_messages=DEBUG5 (for verbose logs)

version: "3.8"
services:
  pgdb:
    image: postgres:15-alpine
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: postgres
    ports:
      - 5433:5432
    command:
      - "postgres"
      - "-c"
      - "archive_mode=on"
      - "-c"
      - "archive_command=/bin/false"
      - "-c"
      - "log_min_messages=DEBUG5"

Now, Start the PostgreSQL container using Docker Compose.

Logs:

In logs, I see archiver process shutting down is followed by a database system is shut down and then server stopped (marked below lines with ---> )

2023-06-29 19:31:09 2023-06-29 14:01:09.885 UTC [42] DEBUG:  executing archive command "/bin/false"
2023-06-29 19:31:09 2023-06-29 14:01:09.886 UTC [42] LOG:  archive command failed with exit code 1
2023-06-29 19:31:09 2023-06-29 14:01:09.886 UTC [42] DETAIL:  The failed archive command was: /bin/false
2023-06-29 19:31:10 2023-06-29 14:01:10.891 UTC [42] DEBUG:  executing archive command "/bin/false"
2023-06-29 19:31:10 2023-06-29 14:01:10.897 UTC [42] LOG:  archive command failed with exit code 1
2023-06-29 19:31:10 2023-06-29 14:01:10.897 UTC [42] DETAIL:  The failed archive command was: /bin/false
2023-06-29 19:31:11 .2023-06-29 14:01:11.900 UTC [42] DEBUG:  executing archive command "/bin/false"
2023-06-29 19:31:11 2023-06-29 14:01:11.901 UTC [42] LOG:  archive command failed with exit code 1
2023-06-29 19:31:11 2023-06-29 14:01:11.901 UTC [42] DETAIL:  The failed archive command was: /bin/false
2023-06-29 19:31:11 2023-06-29 14:01:11.901 UTC [42] WARNING:  archiving write-ahead log file "000000010000000000000001" failed too many times, will try again later
2023-06-29 19:31:11 2023-06-29 14:01:11.901 UTC [42] DEBUG:  shmem_exit(0): 5 before_shmem_exit callbacks to make
---> 2023-06-29 19:31:11 2023-06-29 14:01:11.901 UTC [42] DEBUG:  archiver process shutting down
2023-06-29 19:31:11 2023-06-29 14:01:11.901 UTC [42] DEBUG:  shmem_exit(0): 6 on_shmem_exit callbacks to make
2023-06-29 19:31:11 2023-06-29 14:01:11.901 UTC [42] DEBUG:  proc_exit(0): 1 callbacks to make
2023-06-29 19:31:11 2023-06-29 14:01:11.901 UTC [42] DEBUG:  exit(0)
2023-06-29 19:31:11 2023-06-29 14:01:11.901 UTC [42] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2023-06-29 19:31:11 2023-06-29 14:01:11.901 UTC [42] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2023-06-29 19:31:11 2023-06-29 14:01:11.901 UTC [42] DEBUG:  proc_exit(-1): 0 callbacks to make
2023-06-29 19:31:11 2023-06-29 14:01:11.903 UTC [36] DEBUG:  reaping dead processes
2023-06-29 19:31:11 2023-06-29 14:01:11.903 UTC [36] DEBUG:  shmem_exit(0): 0 before_shmem_exit callbacks to make
2023-06-29 19:31:11 2023-06-29 14:01:11.903 UTC [36] DEBUG:  shmem_exit(0): 5 on_shmem_exit callbacks to make
2023-06-29 19:31:11 2023-06-29 14:01:11.903 UTC [36] DEBUG:  cleaning up orphaned dynamic shared memory with ID 3179138182
2023-06-29 19:31:11 2023-06-29 14:01:11.903 UTC [36] DEBUG:  cleaning up dynamic shared memory control segment with ID 1646515230
2023-06-29 19:31:11 2023-06-29 14:01:11.908 UTC [36] DEBUG:  proc_exit(0): 2 callbacks to make
---> 2023-06-29 19:31:11 2023-06-29 14:01:11.908 UTC [36] LOG:  database system is shut down
2023-06-29 19:31:11 2023-06-29 14:01:11.908 UTC [36] DEBUG:  exit(0)
2023-06-29 19:31:11 2023-06-29 14:01:11.908 UTC [36] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2023-06-29 19:31:11 2023-06-29 14:01:11.908 UTC [36] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2023-06-29 19:31:11 2023-06-29 14:01:11.908 UTC [36] DEBUG:  proc_exit(-1): 0 callbacks to make
2023-06-29 19:31:11  done
---> 2023-06-29 19:31:11 server stopped

Complete Logs are available here: pgdb-docker.log

Additional Information: Docker Engine Version: 24.0.2 Docker Compose Version: v2.18.1

tianon commented 1 year ago

That server stopped line is actually unrelated to the archiver -- just a couple more lines in your log you'll see PostgreSQL init process complete; ready for start up. (and the server starting back up again, after which it does not shut down again). :sweat_smile:

(I used docker run --name postgres -it --rm --env POSTGRES_PASSWORD=foo postgres -c archive_mode=on -c archive_command=/bin/false -c log_min_messages=DEBUG5 and let it run for quite a while without issue beyond the expected constant complaints about the archive command failing before finally shutting it down myself :sweat_smile:)

sumanmichael commented 1 year ago

Thank you @tianon for taking the time to test and provide feedback.

If the server stopped line is unrelated to the archiver, I’m curious to know what is causing it. It is concerning because it results in the existing connections being dropped. I’ve observed that it typically occurs once, usually during the first or second set of failed archive commands.

To clarify, while the database does restart and recover, my main concern is the brief interruption in existing connections during this process.

I appreciate your assistance, and I want to mention that I am still able to reproduce this scenario. Please let me know if there’s any additional information or insights you can provide regarding this issue.

yosifkit commented 1 year ago

The server stopped message is caused by the entrypoint and its initialization logic. It is the one printing PostgreSQL init process complete; ready for start up.: https://github.com/docker-library/postgres/blob/cba2a05c03706daf5f9a66b93a447540b62df063/docker-entrypoint.sh#L332-L336

On first start (when the database hasn't yet been initialized), the entrypoint script goes through these steps (the _main function):

setup filesystem (like make sure the $PGDATA directory exists; fix permissions if possible)
drop down to postgres user (if container was started as root)
run initdb (postgres binary to initialize database files)
add $POSTGRES_HOST_AUTH_METHOD (or default password_encryption) to pg_hba.conf
start a temporary postgres server (that only listens on a unix socket and localhost, so no external connections)
create $POSTGRES_DB database
run user's scripts and sql files from /docker-entrypoint-initdb.d/
stop the temporary postgres server
exec (so that postgres is pid 1 of the container) to run PostgreSQL server, but this time listening for remote connections

Only steps 1, 2 and 9 are done if the database files already exist.

Because the temporary server only listens internally and only happens on first initialization, it cannot affect any external connections. External connections to postgres cannot happen until initialization is fully complete.

Side note: you will eventually need a volume for /var/lib/postgresql/data so that you can control the lifecycle of your database files.

sumanmichael commented 1 year ago

Thanks @yosifkit..! This helped me to debug my case further.

In my case, it's actually crashing due to pgbackrest archive-push ... command being called as archive_command in async mode. I've changed to sync mode. It worked.

Closing the issue.

docker-library / postgres