EnterpriseDB / barman

Barman - Backup and Recovery Manager for PostgreSQL
https://www.pgbarman.org/
GNU General Public License v3.0
2.03k stars 191 forks source link

Remote DB container issues #968

Open ggefgreh opened 1 month ago

ggefgreh commented 1 month ago

Hi, I think this is more the question (skill issue) than a problem of this tool.

Config

It is possible that I missunderstood the documentation, but I think it is possible to configure barman like that:

configuration_files_directory = /etc/barman.d barman_home = /var/lib/barman

log_file = /var/log/barman/barman.log log_level = DEBUG compression = gzip immediate_checkpoint = true basebackup_retry_times = 3 basebackup_retry_sleep = 30 check_timeout = 60 last_backup_maximumage = 1 DAYS

Server config:

[server1] ssh_command = ssh postgres@192.168.76.10 -p 5422 conninfo = host=192.168.76.10 port=5432 user=dbadmin dbname=my-db streaming_conninfo = host=192.168.76.10 port=5432 user=streaming_barman

backup_method = rsync reuse_backup = link backup_options = concurrent_backup

streaming_archiver = on slot_name = barman create_slot = auto

parallel_jobs = 2

Problem

The main problem I see is I cannot perform a backup, but it seems like underlying problem is missconfiguration or something else with WAL. Also when I execute the check command, it looks like it crashes database and it is going in recovery mode (recover itself then). It also happens when trying to perform backup, but I don't see anything wrong in postgres container logs, only that: LOG: server process (PID 1858) exited with exit code 255, whe trying that backup or check.

Here are the results of check $ barman check server1 Server server1: WAL archive: FAILED (please make sure WAL shipping is setup) PostgreSQL: OK superuser or standard user with backup privileges: OK PostgreSQL streaming: OK wal_level: OK replication slot: OK directories: OK retention policy settings: OK backup maximum age: FAILED (interval provided: 1 day, latest backup age: No available backups) backup minimum size: OK (0 B) wal maximum age: OK (no last_wal_maximum_age provided) wal size: OK (0 B) compression settings: OK failed backups: FAILED (there are 7 failed backups) minimum redundancy requirements: OK (have 0 backups, expected at least 0) ssh: OK (PostgreSQL server) systemid coherence: OK pg_receivexlog: OK pg_receivexlog compatible: OK receive-wal running: OK archiver errors: OK

The results of backup gives either information that it cannot be performed or that database is in recovery.

P.S.

I already spend too much time on this. Before that I tried pgbackrest but I gave up on that. I will be gratefull for even the direction of further steps. I am fairly new to postgres and never performed any backup with it.

martinmarques commented 1 month ago

As you are trying to set up streaming of WALs, you should set archiver = off so it doesn't expect WALs to come in via archive_command.

Regarding the postgres server crashing, you'll have to find more information inside the container, probably increasing the log level or checking for core dumps. I would also check if you can connect directly to the database and run queries on it. barman check just connects via psycopg2 (normal libpq connection) and runs check queries. I can't think of a reason why it would crash.

ggefgreh commented 1 month ago

okay, I recreated the slot in the meantime and all of a sudden WAL archive row is gone from the check, but I added that archiver = off as you advised, just to be sure. From time to time, after backup tries, when getting that database system is in recovery it comes back as FAILED. Mainly, now I am getting this in check: $ barman check server1 Server server1: PostgreSQL: OK superuser or standard user with backup privileges: OK PostgreSQL streaming: OK wal_level: OK replication slot: OK directories: OK retention policy settings: OK backup maximum age: FAILED (interval provided: 1 day, latest backup age: No available backups) backup minimum size: OK (0 B) wal maximum age: OK (no last_wal_maximum_age provided) wal size: OK (0 B) compression settings: OK failed backups: FAILED (there are 1 failed backups) minimum redundancy requirements: FAILED (have 0 backups, expected at least 3) ssh: OK (PostgreSQL server) systemid coherence: OK pg_receivexlog: OK pg_receivexlog compatible: OK receive-wal running: OK archiver errors: OK

So it seems good from that perspective.

To address what you told me to check:

I think it is due to no WALs in wals dir. Because I see this in barman logs: barman.wal_archiver INFO: No xlog segments found from streaming for server1

martinmarques commented 1 month ago

To get WALs in the wals directory run the following SQL on the postgres node as a superuser: SELECT pg_switch_wal();. Then on the barman server run barman archive-wal or wait for the next barman cron to run. That should produce a WAL switch and the WAL would get archived in the wals directory. That .partial file is from the streaming of WALs. I don't think the crash has anything to do with no WALs available in the wals directory

ggefgreh commented 1 month ago

Okay, I did what you said and it achieved starting a backup but it had been interrupted due to connection termination which couldn't be reconnected. I can't provide logs and output, because I already cleaned them up - it was a while ago and I hoped it wasn't one-time event, but it was. I was looking for setting up longer keep-alive time, but I am not sure if it is related to postgres or system ssh settings.

For now, I enabled logs for connections and disconnections and there this happens when performing a backup:

LOCATION:  set_authn_id, auth.c:371
[11297] LOG:  00000: replication connection authorized: user=streaming_barman
[11297] LOCATION:  PerformAuthentication, postinit.c:296
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[11297] LOG:  00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION:  exec_replication_command, walsender.c:1771
[11297] STATEMENT:  IDENTIFY_SYSTEM
[1] LOG:  00000: server process (PID 11309) exited with exit code 255
[1] LOCATION:  LogChildExit, postmaster.c:3737
[1] LOG:  00000: terminating any other active server processes
[1] LOCATION:  HandleChildCrash, postmaster.c:3504
[11297] LOG:  00000: disconnection: session time: 0:00:00.170 user=streaming_barman database= host=192.168.76.14 port=34568
[11297] LOCATION:  log_disconnections, postgres.c:5029
[11311] LOG:  00000: connection received: host=192.168.76.14 port=34584
[11311] LOCATION:  BackendInitialize, postmaster.c:4400
[11311] FATAL:  57P03: the database system is in recovery mode
[11311] LOCATION:  ProcessStartupPacket, postmaster.c:2410
[1] LOG:  00000: all server processes terminated; reinitializing
[1] LOCATION:  PostmasterStateMachine, postmaster.c:4018
[11312] LOG:  00000: database system was interrupted; last known up at 2024-07-26 17:59:57 UTC
[11312] LOCATION:  StartupXLOG, xlog.c:4962
[11315] LOG:  00000: connection received: host=192.168.76.14 port=34590
[11315] LOCATION:  BackendInitialize, postmaster.c:4400
[11315] FATAL:  57P03: the database system is in recovery mode
LOCATION:  ProcessStartupPacket, postmaster.c:2410
LOG:  00000: database system was not properly shut down; automatic recovery in progress
[11312] LOCATION:  InitWalRecovery, xlogrecovery.c:896
[11312] LOG:  00000: redo starts at 3/720000D8
[11312] LOCATION:  PerformWalRecovery, xlogrecovery.c:1659
[11312] LOG:  00000: invalid record length at 3/7200A300: wanted 24, got 0
[11312] LOCATION:  ReadRecord, xlogrecovery.c:3082
[11312] LOG:  00000: redo done at 3/7200A2C8 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
[11312] LOCATION:  PerformWalRecovery, xlogrecovery.c:1796
[11313] LOG:  00000: checkpoint starting: end-of-recovery immediate wait
[11313] LOCATION:  LogCheckpointStart, xlog.c:6093
[11313] LOG:  00000: checkpoint complete: wrote 47 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=40 kB, estimate=40 kB
[11313] LOCATION:  LogCheckpointEnd, xlog.c:6174
[1] LOG:  00000: database system is ready to accept connections

As far as I can read and I don't know what it looks like, I would see it seems like a streaming problem (the time is right for backup command execution). I also didn't observe any abnormality in CPU/Disk/Usage while performing a backup or, anything else. Only dying all DB-related processes while crashing.

Is it normal that it has so many IDENTIFY_SYSTEM executions?

And of course, it is not related to empty wals directory, because now it is not empty. I will try to explore some more logs and maybe once again go from the beginning and check the configuration for the 99th time.

martinmarques commented 1 month ago

Is it normal that it has so many IDENTIFY_SYSTEM executions?

Those are streaming connections where pg_receivewal requests the server to identify before it starts streaming

https://www.postgresql.org/docs/current/protocol-replication.html

I can't find anything useful in those logs. You'll probably have to increase the log_level in Postgres.

My suggestion is to bring this to the pgsql-general mailing list or the Postgres Slack/Discord channels.