EnterpriseDB / barman

Barman - Backup and Recovery Manager for PostgreSQL
https://www.pgbarman.org/
GNU General Public License v3.0
2.14k stars 193 forks source link

Barman can't perform backup (3.9 upgrade issue?) #856

Closed sergei-maertens closed 1 year ago

sergei-maertens commented 1 year ago

Hi - my barman installation appears to not be properly working anymore after I updated packages on my OS.

I received the following cron output causing me to investigate:

ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check srv1-regex-it-nl-pg14'

My environment:

root@backups:~# apt-cache show barman
Package: barman
Version: 3.9.0-1.pgdg20.04+1
Architecture: all
Maintainer: Marco Nenciarini <mnencia@debian.org>
Installed-Size: 81
Depends: python3-barman (= 3.9.0-1.pgdg20.04+1), python3-pkg-resources, adduser, rsync, python3:any
Recommends: openssh-server, openssh-client, postgresql-client
Suggests: barman-cli, repmgr
Homepage: https://www.pgbarman.org
Priority: optional
Section: database
Filename: pool/main/b/barman/barman_3.9.0-1.pgdg20.04+1_all.deb
Size: 46212
SHA256: a87592b0ef1d2ff8d262c7568f8f5e270e6830e2f9b5ae3b6c9d9defcf200c38
SHA1: 6506861d28267416578fa5d1f1197a90468d3f5b
MD5sum: 64764e1179b7fad2bbd21727c4742ecd
Description-en: Backup and Recovery Manager for PostgreSQL
 Barman (Backup and Recovery Manager) is an open-source
 administration tool for disaster recovery of PostgreSQL
 servers written in Python.
 .
 It allows your organization to perform remote backups of
 multiple servers in business critical environments to
 reduce risk and help DBAs during the recovery phase.
 .
 Barman is distributed under GNU GPL 3 and maintained
 by 2ndQuadrant.
 .
 This package provides barman binary.
Description-md5: d22bbe67949a3c9d16fae95cbb531954

I've updated my system (apt-get dist-upgrade) over the weekend and that pulled in the 2-weeks ago released barman 3.9.0.

When I run barman status all, the output of the command hangs too:

root@backups:~# barman status all
Server geralt-modelbrouwers-nl-pg13:
    Description: PostgreSQL Database (Streaming-Only)
    Active: True
    Disabled: False

I had a working setup before, with 3 clusters being managed by barman.

Do you have any pointers of what could be wrong or what else I can investigate?

mikewallace1979 commented 1 year ago

Hi @sergei-maertens - which Barman version are you upgrading from?

Could also run barman check srv1-regex-it-nl-pg14 and barman diagnose and post the output here, after removing any sensitive information from the output?

Are there any errors in the barman.log file?

sergei-maertens commented 1 year ago

hi @mikewallace1979 - thanks for getting back so quickly

which Barman version are you upgrading from?

Checked my apt logs, and this was an upgrade from 3.6 to 3.9:

barman:amd64 (3.6.0-1.pgdg20.04+1, 3.9.0-1.pgdg20.04+1)

Could also run barman check srv1-regex-it-nl-pg14 and barman diagnose and post the output here, after removing any sensitive information from the output?

So the check I've run before, and it has the same hanging problem:

root@backups:~# barman check srv1-regex-it-nl-pg14
Server srv1-regex-it-nl-pg14:

(hitting CTRL+C also doesn't have an immediate effect with any of these commands and opening a new SSH connection is required)

barman diagnose also hangs and does not provide any output.

Are there any errors in the barman.log file?

I'm seeing some new suspicious records now actually:

2023-10-16 12:58:02,664 [120211] barman.wal_archiver INFO: No xlog segments found from streaming for pluksla-regex-it-nl-pg15.
2023-10-16 12:58:38,814 [85504] barman.command_wrappers INFO: geralt-modelbrouwers-nl-pg13: pg_receivewal: finished segment at 1B/B9000000 (timeline 1)
2023-10-16 12:59:01,866 [120232] barman.wal_archiver INFO: No xlog segments found from streaming for pluksla-regex-it-nl-pg15.
2023-10-16 12:59:02,002 [120231] barman.wal_archiver INFO: Found 1 xlog segments from streaming for geralt-modelbrouwers-nl-pg13. Archive all segments in one run.
2023-10-16 12:59:02,002 [120231] barman.wal_archiver INFO: Archiving segment 1 of 1 from streaming: geralt-modelbrouwers-nl-pg13/000000010000001B000000B8
2023-10-16 12:59:02,023 [120233] barman.wal_archiver INFO: No xlog segments found from streaming for srv1-regex-it-nl-pg14.
...
2023-10-16 13:00:08,422 [120204] barman.cli ERROR: Process interrupted by user (KeyboardInterrupt)
2023-10-16 13:01:02,475 [120396] barman.wal_archiver INFO: No xlog segments found from streaming for pluksla-regex-it-nl-pg15.
...
2023-10-16 13:05:02,382 [120484] barman.wal_archiver INFO: No xlog segments found from streaming for srv1-regex-it-nl-pg14.
2023-10-16 13:05:40,190 [120453] barman.server INFO: Check command timed out executing 'PostgreSQL' check
2023-10-16 13:05:40,190 [120453] barman.server ERROR: Check 'check timeout' failed for server 'srv1-regex-it-nl-pg14'
2023-10-16 13:05:40,192 [120453] barman.server ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check srv1-regex-it-nl-pg14'
...
2023-10-16 14:13:36,994 [122211] Command WARNING: No LSB modules are available.
2023-10-16 14:13:37,028 [122211] Command WARNING: Python 2.7.18
2023-10-16 14:13:37,056 [122211] Command WARNING: OpenSSH_8.2p1 Ubuntu-4ubuntu0.9, OpenSSL 1.1.1f  31 Mar 2020

The ellipses are truncated "no xlog segments found..." records which are normal behaviour I believe, there's not a lot of activity on these databases.

edit: the timeout made me check if I can open a telnet connection and I see it's trying to connect over ipv6. Over the weekend I set up DNS for ipv6 so that is probably affecting things - and the remote PG server firewall only allows ipv4. So the problem is most likely on my end :grimacing:

edit2: pg_isready is fine though, and doesn't appear to try to connect over ipv6:

root@backups:~# pg_isready -p 5432 -h srv1.regex-it.nl
srv1.regex-it.nl:5432 - accepting connections
mikewallace1979 commented 1 year ago

The timeout while connecting to PostgreSQL does seem to be the most likely reason for the failure.

Barman uses the psycopg2 library to connect to PostgreSQL which is not used by pg_isready. There is a report of psycopg2 waiting 127 seconds when attempting an ipv6 connection before falling back to ipv4 - I can't find any reference to this in the psycopg2 github repo, however if this is how psycopg2 behaves then it would explain why the default Barman check timeout of 30 seconds is exceeded in your setup where ipv6 is firewalled and ipv4 isn't.

I haven't been able to verify this report but it might be worth taking a closer look at the ipv6/ipv4 hypothesis - as a test you could try replacing the hostname with the IP4 address in the conninfo string in the Barman config.

sergei-maertens commented 1 year ago

Yes, that was indeed my plan. I have to get back to $dayjob now, will report back later. Luckily I'm familiar with psycopg2 so I can dig around too to check if there's a way to force ipv4 if the cause is confirmed.

sergei-maertens commented 1 year ago

Confirmed that it works as expected with direct ipv4 IP address as host. I can't find a (documented) way to force ipv4 from a host name, so the resolution will be to either configure the server with the IP address or open up the firewall to accept ipv6 connections.

psycopg2 uses libpq under the hood from what I've gathered, and that doesn't seem to document any such options. Either way, out of scope for barman, I'd say.