EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.58k stars 252 forks source link

CRITICAL (node "foo" (ID: 2) is not attached to expected upstream node "bar" (ID: 1) repmgr-16 #838

Open dirtyworks opened 10 months ago

dirtyworks commented 10 months ago

Attempting to add an old node back into the cluster fails despite no obvious errors during join process. Both nodes are idle sandbox machines. Plenty of CPU and memory, doing nothing. This is on RHEL8, selinux enabled, and postgresql.org repos, Postgresql 16, repmgr-16. Installed with yum. selinux enabled is mandatory. Steps in order executed.

  1. rm -rf /var/lib/pgsql/16/data/*

  2. rm -rf /var/lib/pgsql/16/wal/*

    • /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf standby clone --upstream-conninfo 'host=bar port=15432 dbname=repmgr user=superduper passfile=/var/lib/pgsql/.pgpass sslmode=prefer sslcert=/etc/ssl/certs/host.crt sslkey=/var/lib/pgsql/key.pem sslrootcert=/etc/ssl/certs/ca-bundle.crt' -d 'bar port=15432 dbname=repmgr user=superduper passfile=/var/lib/pgsql/.pgpass sslmode=prefer sslcert=/etc/ssl/certs/host.crt sslkey=/var/lib/pgsql/key.pem sslrootcert=/etc/ssl/certs/ca-bundle.crt' -v -L DEBUG
      NOTICE: standby clone (using pg_basebackup) complete
      NOTICE: you can now start your PostgreSQL server
      HINT: for example: /usr/pgsql-16/bin/pg_ctl start -D /var/lib/pgsql/16/data
      DEBUG: get_node_record():
      SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 2
      DEBUG: get_node_record(): no record found for node 2
      HINT: after starting the server, you need to register this standby with "repmgr standby register"

      I am logging all queries on bar and an insert is not being called on repmgr.nodes during cloning.
      I assume the last get_node_record() is actually successful because "standby register" hasn't been run. But, maybe some part of the cloning process has failed because foo isn't registered in the repmgr.nodes table? If that's the case, then there's no indication the cloning process has failed.

  3. /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf node service --action start

  4. /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf standby register -d 'host=bar port=15432 dbname=repmgr user=superduper passfile=/var/lib/pgsql/.pgpass sslmode=prefer sslcert=/etc/ssl/certs/host.crt sslkey=/var/lib/pgsql/key.pem sslrootcert=/etc/ssl/certs/ca-bundle.crt' -v -L DEBUG --upstream-node-id=1 ERROR: this node does not appear to be attached to upstream node "bar" (ID: 1) I can force the command and it registers, but it's not connected.
    Somewhere on the internet, someone had "standby follow" work. It didn't. It exited successful as well.

  5. /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf standby follow -d 'host=bar port=15432 dbname=repmgr user=superduper passfile=/var/lib/pgsql/.pgpass sslmode=prefer sslcert=/etc/ssl/certs/host.crt sslkey=/var/lib/pgsql/key.pem sslrootcert=/etc/ssl/certs/ca-bundle.crt' -v -L DEBUG --upstream-node-id=1

    WARNING: node "foo" not found in "pg_stat_replication"
    DEBUG: sleeping 30 of max 30 seconds waiting for standby to attach to primary
    NOTICE: STANDBY FOLLOW successful
  6. /usr/pgsql-16/bin/repmgr -f /etc/repmgr/16/repmgr.conf node check Upstream connection: CRITICAL (node "foo" (ID: 2) is not attached to expected upstream node "bar" (ID: 1))

No amount of restarting repmgr, postgresql on either node changes the outcome. Deleting and re-cloning the primary doesn't change the results.