Cannot setup a 2-node cluster with Patroni

pierreforstmann commented 11 months ago

Hello,

I'm trying to build a 2 node cluster on RHEL 8.8.

I run: ansible-playbook install.yml -i inventory.yml

with:

$ cat install.yml 
---
- hosts: all
  name: Patroni postgres cluster deployment playbook
  become: true
  any_errors_fatal: true
  gather_facts: true

  collections:
    - edb_devops.edb_postgres

  pre_tasks:
    - name: Initialize the user defined variables
      ansible.builtin.set_fact:
        pg_version: 14
        enable_edb_repo: false
        pg_type: PG
        disable_logging: false
        use_hostname: true

  roles:
    - role: setup_repo
      when: "'setup_repo' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"
    - role: install_dbserver
      when: "'install_dbserver' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"
    - role: setup_etcd
      when: "'setup_etcd' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"
    - role: setup_patroni
      when: "'setup_patroni' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"

and

$ cat inventory.yml 
---
all:
  children:
    primary:
      hosts:
        cn2:
          ansible_host: 192.168.121.199
          private_ip: 192.168.121.199 
          etcd: true
          etcd_cluster_name: 'patroni-etcd'
    standby:
      hosts:
        cn3:
          ansible_host: 192.168.121.70 
          private_ip: 192.168.121.70 
          upstream_node_private_ip: 192.168.121.199 
          replication_type: asynchronous
          etcd: true
          etcd_cluster_name: 'patroni-etcd'

Playbook fails with:

TASK [edb_devops.edb_postgres.setup_patroni : Pause for few seconds for postgres to be available] **************************************************************************
Pausing for 30 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
ok: [cn2]

TASK [edb_devops.edb_postgres.setup_patroni : Copy the postgresql.conf.template to the server] *****************************************************************************
skipping: [cn3]
ok: [cn2]

TASK [edb_devops.edb_postgres.setup_patroni : Restart postgres after change of postgresql.auto.conf] ***********************************************************************
skipping: [cn3]
changed: [cn2]

TASK [Generate the pg_superuser_password] **********************************************************************************************************************************
skipping: [cn2]

TASK [edb_devops.edb_postgres.setup_patroni : Set pg_superuser_password] ***************************************************************************************************
skipping: [cn2]

TASK [Set postgres superuser's database cluster password] ******************************************************************************************************************

TASK [edb_devops.edb_postgres.manage_dbserver : Check database is in read/write mode] **************************************************************************************
fatal: [cn2]: FAILED! => {"changed": false, "msg": "unable to connect to database: connection to server on socket \"/var/run/postgresql/.s.PGSQL.5432\" failed: No such file or directory\n\tIs the server running locally and accepting connections on that socket?\n"}

NO MORE HOSTS LEFT *********************************************************************************************************************************************************

PLAY RECAP *****************************************************************************************************************************************************************
cn2                        : ok=134  changed=12   unreachable=0    failed=1    skipped=96   rescued=0    ignored=0   
cn3                        : ok=108  changed=8    unreachable=0    failed=0    skipped=95   rescued=0    ignored=0

On primary node it looks like initdb has not been run:

[root@cn2 ~]# find  /var/lib/pgsql/14
/var/lib/pgsql/14
/var/lib/pgsql/14/backups
/var/lib/pgsql/14/data
/var/lib/pgsql/14/main
/var/lib/pgsql/14/main/data
/var/lib/pgsql/14/main/data/postgresql.auto.conf

Could you please tell me what is wrong here ?

Thanks.

pierreforstmann commented 11 months ago

Hello,

After removing failed installations on cluster nodes cn2 and cn3 and restarting playbook I have a different behavour.

I get:


TASK [Run query to check if replication was setup correctly on primary] *****************************************************************************************************************

TASK [edb_devops.edb_postgres.manage_dbserver : Execute sql scripts] ********************************************************************************************************************
skipping: [cn2]

TASK [edb_devops.edb_postgres.manage_dbserver : Execute query] **************************************************************************************************************************
ok: [cn2 -> 192.168.121.199] => (item={'query': 'Select application_name from pg_stat_replication', 'db': 'postgres'})

TASK [edb_devops.edb_postgres.setup_patroni : Set patroni_stat_query_result with sql_query_output] **************************************************************************************
ok: [cn2]
ok: [cn3]

TASK [edb_devops.edb_postgres.setup_patroni : Check that replication was successful on primary] *****************************************************************************************
fatal: [cn2]: FAILED! => {
    "assertion": "patroni_stat_query_result.results[0].query_result|length == patroni_standby_list|length",
    "changed": false,
    "evaluated_to": false,
    "msg": "Replication was not successful on primary"
}

NO MORE HOSTS LEFT **********************************************************************************************************************************************************************

PLAY RECAP ******************************************************************************************************************************************************************************
cn2                        : ok=202  changed=47   unreachable=0    failed=1    skipped=126  rescued=0    ignored=0   
cn3                        : ok=128  changed=30   unreachable=0    failed=0    skipped=117  rescued=0    ignored=0

Now on primary, instance is running and patroni says:

[root@cn2 ~]# patronictl -c /etc/patroni/cn2.yml list

+ Cluster: main (7299048075194060416) ------------------+----+-----------+
| Member | Host            | Role    | State            | TL | Lag in MB |
+--------+-----------------+---------+------------------+----+-----------+
| cn2    | 192.168.121.199 | Leader  | running          |  1 |           |
| cn3    | 192.168.121.70  | Replica | creating replica |    |   unknown |
+--------+-----------------+---------+------------------+----+-----------+
[root@cn2 ~]#

On standby node, I can see that pg_basebackup is still running and using a lot of CPU:


[root@cn3 ~]# ps -fu postgres
UID          PID    PPID  C STIME TTY          TIME CMD
postgres   71306       1  0 11:14 ?        00:00:08 /usr/bin/etcd --config-file /etc/etcd/etcd-3.5.7.conf
postgres   75006       1  0 11:15 ?        00:00:01 /usr/libexec/platform-python /usr/local/bin/patroni /etc/patroni/cn3.yml
postgres   75034   75006 13 11:15 ?        00:02:42 /usr/pgsql-14/bin/pg_basebackup --pgdata=/var/lib/pgsql/14/main/data -X stream --dbname=dbname=postgres user=repu
[root@cn3 ~]#

But nothing is created:

[root@cn3 ~]# find /var/lib/pgsql/
/var/lib/pgsql/
/var/lib/pgsql/14
/var/lib/pgsql/14/main
/var/lib/pgsql/14/main/data
/var/lib/pgsql/.pgpass
[root@cn3 ~]#

On standby node, journalclt -xe reports lots of:

Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:

Note that primary instance has only the 3 default databases:

postgres@cn2 ~]$ psql
psql (14.9)
Type "help" for help.

postgres=# \l+
                                                                    List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   |  Size   | Tablespace |                Description                 
-----------+----------+----------+-------------+-------------+-----------------------+---------+------------+--------------------------------------------
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                       | 8553 kB | pg_default | default administrative connection database
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +| 8401 kB | pg_default | unmodifiable empty database
           |          |          |             |             | postgres=CTc/postgres |         |            | 
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +| 8401 kB | pg_default | default template for new databases
           |          |          |             |             | postgres=CTc/postgres |         |            | 
(3 rows)

postgres=#

Could you please tell me what is wrong here ?

Thanks.

hannahms commented 11 months ago

Hi there,

I am not entirely sure what is going on here, but I can attempt to assist. Firstly, I would ensure that the variable use_patroni is set to true, which can be included in the set_facts pre-tasks. Without this set, there may be some issues with initdb running. I would also go through the variables within the /defaults/main.yml, especially the pg_superuser_password and pg_replication_user_password due to the journalctl reports. I would also urge to deploy a 3-node cluster with Patroni, using only two nodes in an ETCD cluster will not reach quorum, which may be contributing to failure. This is particularly important during the pg_basebackup process and is likely a factor to the issues seen.

pierreforstmann commented 11 months ago

Hello,

I have tried with following setup:

$ cat inventory.yml
---
all:
  children:
    primary:
      hosts:
        cn1:
          ansible_host: 192.168.121.238
          private_ip: 192.168.121.238
          etcd: true
          etcd_cluster_name: 'patroni-etcd'
    standby:
      hosts:
        cn2:
          ansible_host: 192.168.121.199
          private_ip: 192.168.121.199
          upstream_node_private_ip: 192.168.121.238
          replication_type: asynchronous
          etcd: true
          etcd_cluster_name: 'patroni-etcd'
        cn3:
          ansible_host: 192.168.121.70
          private_ip: 192.168.121.70
          upstream_node_private_ip: 192.168.121.238
          replication_type: asynchronous
          etcd: true
          etcd_cluster_name: 'patroni-etcd'

and

$ cat install.yml
---
- hosts: all
  name: Patroni postgres cluster deployment playbook
  become: true
  any_errors_fatal: true
  gather_facts: true

  collections:
    - edb_devops.edb_postgres

  pre_tasks:
    - name: Initialize the user defined variables
      ansible.builtin.set_fact:
        pg_version: 14
        enable_edb_repo: false
        pg_type: PG
        use_patroni: true
        disable_logging: false
        use_hostname: true

  roles:
    - role: setup_repo
      when: "'setup_repo' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"
    - role: install_dbserver
      when: "'install_dbserver' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"
    - role: setup_etcd
      when: "'setup_etcd' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"
    - role: setup_patroni
      when: "'setup_patroni' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"

The playbook fails with the same error message:

TASK [edb_devops.edb_postgres.setup_patroni : Check that replication was successful on primary] ***
fatal: [cn1]: FAILED! => {
    "assertion": "patroni_stat_query_result.results[0].query_result|length == patroni_standby_list|length",
    "changed": false,
    "evaluated_to": false,
    "msg": "Replication was not successful on primary"
}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
cn1                        : ok=203  changed=69   unreachable=0    failed=1    skipped=125  rescued=0    ignored=0
cn2                        : ok=129  changed=36   unreachable=0    failed=0    skipped=116  rescued=0    ignored=0
cn3                        : ok=129  changed=37   unreachable=0    failed=0    skipped=116  rescued=0    ignored=0

The cluster status is in a similar state:

$ patronictl -c /etc/patroni/cn1.yml list
+ Cluster: main (7303908701948097507) ------------------+----+-----------+
| Member | Host            | Role    | State            | TL | Lag in MB |
+--------+-----------------+---------+------------------+----+-----------+
| cn1    | 192.168.121.238 | Leader  | running          |  1 |           |
| cn2    | 192.168.121.199 | Replica | creating replica |    |   unknown |
| cn3    | 192.168.121.70  | Replica | creating replica |    |   unknown |
+--------+-----------------+---------+------------------+----+-----------+
$

pierreforstmann commented 11 months ago

Hello,

I have found that pg_basebackup issue is caused by home directory of postgres Linux user account.

Because setup_etcdrole has created this user account, its home directory is /home/postgres but others roles are expecting that postgres user account home directory is /var/lib/pgsql (pg_basebackup loops because it does not find .pgpass in /home/postgres).

I managed to have a working installation with following patch:

$ git diff roles/setup_etcd/tasks/etcd_user_group.yml
diff --git a/roles/setup_etcd/tasks/etcd_user_group.yml b/roles/setup_etcd/tasks/etcd_user_group.yml
index fdb8968..0b3c3b8 100644
--- a/roles/setup_etcd/tasks/etcd_user_group.yml
+++ b/roles/setup_etcd/tasks/etcd_user_group.yml
@@ -9,4 +9,7 @@
   ansible.builtin.user:
     name: "{{ etcd_owner }}"
     group: "{{ etcd_group }}"
+# BEGIN PATCH
+    home: "/var/lib/pgsql"
+# END PATCH
   become: true

With this patch, this run:

$  ansible-playbook install.yml -i inventory.yml

ends with:

PLAY RECAP *********************************************************************
cn1                        : ok=204  changed=49   unreachable=0    failed=0    skipped=129  rescued=0    ignored=0   
cn2                        : ok=132  changed=30   unreachable=0    failed=0    skipped=118  rescued=0    ignored=0   
cn3                        : ok=132  changed=30   unreachable=0    failed=0    skipped=118  rescued=0    ignored=0

and cluster looks OK:

# patronictl -c /etc/patroni/cn1.yml list
+ Cluster: main (7304291771927171212) -----------+----+-----------+
| Member | Host            | Role    | State     | TL | Lag in MB |
+--------+-----------------+---------+-----------+----+-----------+
| cn1    | 192.168.121.238 | Leader  | running   |  1 |           |
| cn2    | 192.168.121.199 | Replica | streaming |  1 |         0 |
| cn3    | 192.168.121.70  | Replica | streaming |  1 |         0 |
+--------+-----------------+---------+-----------+----+-----------+
[root@cn1 ~]#

Thanks

hannahms commented 11 months ago

Hi Pierre,

Thanks for finding this bug, I will be sure to fix this within the setup_etcd role and enhance the pg_basebackup configuration within the setup_patroni role.

Thank you!

EnterpriseDB / edb-ansible

Cannot setup a 2-node cluster with Patroni #611