Open ggefgreh opened 1 month ago
As you are trying to set up streaming of WALs, you should set archiver = off
so it doesn't expect WALs to come in via archive_command
.
Regarding the postgres server crashing, you'll have to find more information inside the container, probably increasing the log level or checking for core dumps. I would also check if you can connect directly to the database and run queries on it. barman check
just connects via psycopg2 (normal libpq connection) and runs check queries. I can't think of a reason why it would crash.
okay, I recreated the slot in the meantime and all of a sudden WAL archive
row is gone from the check
, but I added that archiver = off
as you advised, just to be sure. From time to time, after backup
tries, when getting that database system is in recovery it comes back as FAILED.
Mainly, now I am getting this in check
:
$ barman check server1
Server server1:
PostgreSQL: OK
superuser or standard user with backup privileges: OK
PostgreSQL streaming: OK
wal_level: OK
replication slot: OK
directories: OK
retention policy settings: OK
backup maximum age: FAILED (interval provided: 1 day, latest backup age: No available backups)
backup minimum size: OK (0 B)
wal maximum age: OK (no last_wal_maximum_age provided)
wal size: OK (0 B)
compression settings: OK
failed backups: FAILED (there are 1 failed backups)
minimum redundancy requirements: FAILED (have 0 backups, expected at least 3)
ssh: OK (PostgreSQL server)
systemid coherence: OK
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archiver errors: OK
So it seems good from that perspective.
To address what you told me to check:
In terms of interesting things, WALs are not being saved properly on a backup server - I mean there is a file .partial
but the directory wals
is empty, however, I found something in the logs:
LOG: 00000: server process (PID 2471) exited with exit code 255
LOG: 00000: terminating any other active server processes
LOCATION: HandleChildCrash, postmaster.c:3504
LOG: 00000: all server processes terminated; reinitializing
LOCATION: PostmasterStateMachine, postmaster.c:4018
LOG: 00000: database system was interrupted; last known up at <DATETIME>
LOCATION: StartupXLOG, xlog.c:4962
LOG: 00000: database system was not properly shut down; automatic recovery in progress
LOCATION: InitWalRecovery, xlogrecovery.c:896
LOG: 00000: redo starts at 3/63329DC0
LOCATION: PerformWalRecovery, xlogrecovery.c:1659
LOG: 00000: unexpected pageaddr 3/5E38C000 in log segment 000000010000000300000063, offset 3719168
LOCATION: ReadRecord, xlogrecovery.c:3082
LOG: 00000: redo done at 3/6338BF78 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.01 s
LOCATION: PerformWalRecovery, xlogrecovery.c:1796
LOG: 00000: checkpoint starting: end-of-recovery immediate wait
LOCATION: LogCheckpointStart, xlog.c:6093
When trying to perform a backup
with result:
ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check qursorrtls'
I think it is due to no WALs in wals
dir. Because I see this in barman logs:
barman.wal_archiver INFO: No xlog segments found from streaming for server1
To get WALs in the wals
directory run the following SQL on the postgres node as a superuser: SELECT pg_switch_wal();
. Then on the barman server run barman archive-wal
or wait for the next barman cron
to run. That should produce a WAL switch and the WAL would get archived in the wals
directory.
That .partial
file is from the streaming of WALs.
I don't think the crash has anything to do with no WALs available in the wals
directory
Okay, I did what you said and it achieved starting a backup but it had been interrupted due to connection termination which couldn't be reconnected. I can't provide logs and output, because I already cleaned them up - it was a while ago and I hoped it wasn't one-time event, but it was. I was looking for setting up longer keep-alive
time, but I am not sure if it is related to postgres
or system ssh
settings.
For now, I enabled logs for connections and disconnections and there this happens when performing a backup
:
LOCATION: set_authn_id, auth.c:371
[11297] LOG: 00000: replication connection authorized: user=streaming_barman
[11297] LOCATION: PerformAuthentication, postinit.c:296
[11297] LOG: 00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION: exec_replication_command, walsender.c:1771
[11297] STATEMENT: IDENTIFY_SYSTEM
[11297] LOG: 00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION: exec_replication_command, walsender.c:1771
[11297] STATEMENT: IDENTIFY_SYSTEM
[11297] LOG: 00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION: exec_replication_command, walsender.c:1771
[11297] STATEMENT: IDENTIFY_SYSTEM
[11297] LOG: 00000: received replication command: IDENTIFY_SYSTEM
[11297] LOCATION: exec_replication_command, walsender.c:1771
[11297] STATEMENT: IDENTIFY_SYSTEM
[1] LOG: 00000: server process (PID 11309) exited with exit code 255
[1] LOCATION: LogChildExit, postmaster.c:3737
[1] LOG: 00000: terminating any other active server processes
[1] LOCATION: HandleChildCrash, postmaster.c:3504
[11297] LOG: 00000: disconnection: session time: 0:00:00.170 user=streaming_barman database= host=192.168.76.14 port=34568
[11297] LOCATION: log_disconnections, postgres.c:5029
[11311] LOG: 00000: connection received: host=192.168.76.14 port=34584
[11311] LOCATION: BackendInitialize, postmaster.c:4400
[11311] FATAL: 57P03: the database system is in recovery mode
[11311] LOCATION: ProcessStartupPacket, postmaster.c:2410
[1] LOG: 00000: all server processes terminated; reinitializing
[1] LOCATION: PostmasterStateMachine, postmaster.c:4018
[11312] LOG: 00000: database system was interrupted; last known up at 2024-07-26 17:59:57 UTC
[11312] LOCATION: StartupXLOG, xlog.c:4962
[11315] LOG: 00000: connection received: host=192.168.76.14 port=34590
[11315] LOCATION: BackendInitialize, postmaster.c:4400
[11315] FATAL: 57P03: the database system is in recovery mode
LOCATION: ProcessStartupPacket, postmaster.c:2410
LOG: 00000: database system was not properly shut down; automatic recovery in progress
[11312] LOCATION: InitWalRecovery, xlogrecovery.c:896
[11312] LOG: 00000: redo starts at 3/720000D8
[11312] LOCATION: PerformWalRecovery, xlogrecovery.c:1659
[11312] LOG: 00000: invalid record length at 3/7200A300: wanted 24, got 0
[11312] LOCATION: ReadRecord, xlogrecovery.c:3082
[11312] LOG: 00000: redo done at 3/7200A2C8 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
[11312] LOCATION: PerformWalRecovery, xlogrecovery.c:1796
[11313] LOG: 00000: checkpoint starting: end-of-recovery immediate wait
[11313] LOCATION: LogCheckpointStart, xlog.c:6093
[11313] LOG: 00000: checkpoint complete: wrote 47 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=40 kB, estimate=40 kB
[11313] LOCATION: LogCheckpointEnd, xlog.c:6174
[1] LOG: 00000: database system is ready to accept connections
As far as I can read and I don't know what it looks like, I would see it seems like a streaming problem (the time is right for backup
command execution). I also didn't observe any abnormality in CPU/Disk/Usage while performing a backup
or, anything else. Only dying all DB-related processes while crashing.
Is it normal that it has so many IDENTIFY_SYSTEM
executions?
And of course, it is not related to empty wals
directory, because now it is not empty. I will try to explore some more logs and maybe once again go from the beginning and check the configuration for the 99th time.
Is it normal that it has so many IDENTIFY_SYSTEM executions?
Those are streaming connections where pg_receivewal
requests the server to identify before it starts streaming
https://www.postgresql.org/docs/current/protocol-replication.html
I can't find anything useful in those logs. You'll probably have to increase the log_level in Postgres.
My suggestion is to bring this to the pgsql-general mailing list or the Postgres Slack/Discord channels.
Hi, I think this is more the question (skill issue) than a problem of this tool.
Config
It is possible that I missunderstood the documentation, but I think it is possible to configure barman like that:
Servers
Files
Global barman:
_[barman] barman_user = barman
configuration_files_directory = /etc/barman.d barman_home = /var/lib/barman
log_file = /var/log/barman/barman.log log_level = DEBUG compression = gzip immediate_checkpoint = true basebackup_retry_times = 3 basebackup_retry_sleep = 30 check_timeout = 60 last_backup_maximumage = 1 DAYS
Server config:
[server1] ssh_command = ssh postgres@192.168.76.10 -p 5422 conninfo = host=192.168.76.10 port=5432 user=dbadmin dbname=my-db streaming_conninfo = host=192.168.76.10 port=5432 user=streaming_barman
backup_method = rsync reuse_backup = link backup_options = concurrent_backup
streaming_archiver = on slot_name = barman create_slot = auto
parallel_jobs = 2
Problem
The main problem I see is I cannot perform a backup, but it seems like underlying problem is missconfiguration or something else with WAL. Also when I execute the
check
command, it looks like it crashes database and it is going in recovery mode (recover itself then). It also happens when trying to performbackup
, but I don't see anything wrong in postgres container logs, only that:LOG: server process (PID 1858) exited with exit code 255
, whe trying thatbackup
orcheck
.Here are the results of
check
$ barman check server1 Server server1: WAL archive: FAILED (please make sure WAL shipping is setup) PostgreSQL: OK superuser or standard user with backup privileges: OK PostgreSQL streaming: OK wal_level: OK replication slot: OK directories: OK retention policy settings: OK backup maximum age: FAILED (interval provided: 1 day, latest backup age: No available backups) backup minimum size: OK (0 B) wal maximum age: OK (no last_wal_maximum_age provided) wal size: OK (0 B) compression settings: OK failed backups: FAILED (there are 7 failed backups) minimum redundancy requirements: OK (have 0 backups, expected at least 0) ssh: OK (PostgreSQL server) systemid coherence: OK pg_receivexlog: OK pg_receivexlog compatible: OK receive-wal running: OK archiver errors: OKThe results of
backup
gives either information that it cannot be performed or that database is in recovery.P.S.
I already spend too much time on this. Before that I tried
pgbackrest
but I gave up on that. I will be gratefull for even the direction of further steps. I am fairly new to postgres and never performed any backup with it.ssh
paswordlessly between them - I can