Closed pierreforstmann closed 11 months ago
Hello,
After removing failed installations on cluster nodes cn2 and cn3 and restarting playbook I have a different behavour.
I get:
TASK [Run query to check if replication was setup correctly on primary] *****************************************************************************************************************
TASK [edb_devops.edb_postgres.manage_dbserver : Execute sql scripts] ********************************************************************************************************************
skipping: [cn2]
TASK [edb_devops.edb_postgres.manage_dbserver : Execute query] **************************************************************************************************************************
ok: [cn2 -> 192.168.121.199] => (item={'query': 'Select application_name from pg_stat_replication', 'db': 'postgres'})
TASK [edb_devops.edb_postgres.setup_patroni : Set patroni_stat_query_result with sql_query_output] **************************************************************************************
ok: [cn2]
ok: [cn3]
TASK [edb_devops.edb_postgres.setup_patroni : Check that replication was successful on primary] *****************************************************************************************
fatal: [cn2]: FAILED! => {
"assertion": "patroni_stat_query_result.results[0].query_result|length == patroni_standby_list|length",
"changed": false,
"evaluated_to": false,
"msg": "Replication was not successful on primary"
}
NO MORE HOSTS LEFT **********************************************************************************************************************************************************************
PLAY RECAP ******************************************************************************************************************************************************************************
cn2 : ok=202 changed=47 unreachable=0 failed=1 skipped=126 rescued=0 ignored=0
cn3 : ok=128 changed=30 unreachable=0 failed=0 skipped=117 rescued=0 ignored=0
Now on primary, instance is running and patroni says:
[root@cn2 ~]# patronictl -c /etc/patroni/cn2.yml list
+ Cluster: main (7299048075194060416) ------------------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+-----------------+---------+------------------+----+-----------+
| cn2 | 192.168.121.199 | Leader | running | 1 | |
| cn3 | 192.168.121.70 | Replica | creating replica | | unknown |
+--------+-----------------+---------+------------------+----+-----------+
[root@cn2 ~]#
On standby node, I can see that pg_basebackup is still running and using a lot of CPU:
[root@cn3 ~]# ps -fu postgres
UID PID PPID C STIME TTY TIME CMD
postgres 71306 1 0 11:14 ? 00:00:08 /usr/bin/etcd --config-file /etc/etcd/etcd-3.5.7.conf
postgres 75006 1 0 11:15 ? 00:00:01 /usr/libexec/platform-python /usr/local/bin/patroni /etc/patroni/cn3.yml
postgres 75034 75006 13 11:15 ? 00:02:42 /usr/pgsql-14/bin/pg_basebackup --pgdata=/var/lib/pgsql/14/main/data -X stream --dbname=dbname=postgres user=repu
[root@cn3 ~]#
But nothing is created:
[root@cn3 ~]# find /var/lib/pgsql/
/var/lib/pgsql/
/var/lib/pgsql/14
/var/lib/pgsql/14/main
/var/lib/pgsql/14/main/data
/var/lib/pgsql/.pgpass
[root@cn3 ~]#
On standby node, journalclt -xe reports lots of:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Nov 08 11:38:41 cn3 patroni[75034]: Password:
Note that primary instance has only the 3 default databases:
postgres@cn2 ~]$ psql
psql (14.9)
Type "help" for help.
postgres=# \l+
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
-----------+----------+----------+-------------+-------------+-----------------------+---------+------------+--------------------------------------------
postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | | 8553 kB | pg_default | default administrative connection database
template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +| 8401 kB | pg_default | unmodifiable empty database
| | | | | postgres=CTc/postgres | | |
template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +| 8401 kB | pg_default | default template for new databases
| | | | | postgres=CTc/postgres | | |
(3 rows)
postgres=#
Could you please tell me what is wrong here ?
Thanks.
Hi there,
I am not entirely sure what is going on here, but I can attempt to assist.
Firstly, I would ensure that the variable use_patroni
is set to true
, which can be included in the set_facts
pre-tasks. Without this set, there may be some issues with initdb running.
I would also go through the variables within the /defaults/main.yml
, especially the pg_superuser_password
and pg_replication_user_password
due to the journalctl reports.
I would also urge to deploy a 3-node cluster with Patroni, using only two nodes in an ETCD cluster will not reach quorum, which may be contributing to failure. This is particularly important during the pg_basebackup
process and is likely a factor to the issues seen.
Hello,
I have tried with following setup:
$ cat inventory.yml
---
all:
children:
primary:
hosts:
cn1:
ansible_host: 192.168.121.238
private_ip: 192.168.121.238
etcd: true
etcd_cluster_name: 'patroni-etcd'
standby:
hosts:
cn2:
ansible_host: 192.168.121.199
private_ip: 192.168.121.199
upstream_node_private_ip: 192.168.121.238
replication_type: asynchronous
etcd: true
etcd_cluster_name: 'patroni-etcd'
cn3:
ansible_host: 192.168.121.70
private_ip: 192.168.121.70
upstream_node_private_ip: 192.168.121.238
replication_type: asynchronous
etcd: true
etcd_cluster_name: 'patroni-etcd'
and
$ cat install.yml
---
- hosts: all
name: Patroni postgres cluster deployment playbook
become: true
any_errors_fatal: true
gather_facts: true
collections:
- edb_devops.edb_postgres
pre_tasks:
- name: Initialize the user defined variables
ansible.builtin.set_fact:
pg_version: 14
enable_edb_repo: false
pg_type: PG
use_patroni: true
disable_logging: false
use_hostname: true
roles:
- role: setup_repo
when: "'setup_repo' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"
- role: install_dbserver
when: "'install_dbserver' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"
- role: setup_etcd
when: "'setup_etcd' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"
- role: setup_patroni
when: "'setup_patroni' in lookup('edb_devops.edb_postgres.supported_roles', wantlist=True)"
The playbook fails with the same error message:
TASK [edb_devops.edb_postgres.setup_patroni : Check that replication was successful on primary] ***
fatal: [cn1]: FAILED! => {
"assertion": "patroni_stat_query_result.results[0].query_result|length == patroni_standby_list|length",
"changed": false,
"evaluated_to": false,
"msg": "Replication was not successful on primary"
}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
cn1 : ok=203 changed=69 unreachable=0 failed=1 skipped=125 rescued=0 ignored=0
cn2 : ok=129 changed=36 unreachable=0 failed=0 skipped=116 rescued=0 ignored=0
cn3 : ok=129 changed=37 unreachable=0 failed=0 skipped=116 rescued=0 ignored=0
The cluster status is in a similar state:
$ patronictl -c /etc/patroni/cn1.yml list
+ Cluster: main (7303908701948097507) ------------------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+-----------------+---------+------------------+----+-----------+
| cn1 | 192.168.121.238 | Leader | running | 1 | |
| cn2 | 192.168.121.199 | Replica | creating replica | | unknown |
| cn3 | 192.168.121.70 | Replica | creating replica | | unknown |
+--------+-----------------+---------+------------------+----+-----------+
$
Hello,
I have found that pg_basebackup issue is caused by home directory of postgres
Linux user account.
Because setup_etcd
role has created this user account, its home directory is /home/postgres
but others roles are expecting that postgres user account home directory is /var/lib/pgsql
(pg_basebackup loops because it does not find .pgpass
in /home/postgres
).
I managed to have a working installation with following patch:
$ git diff roles/setup_etcd/tasks/etcd_user_group.yml
diff --git a/roles/setup_etcd/tasks/etcd_user_group.yml b/roles/setup_etcd/tasks/etcd_user_group.yml
index fdb8968..0b3c3b8 100644
--- a/roles/setup_etcd/tasks/etcd_user_group.yml
+++ b/roles/setup_etcd/tasks/etcd_user_group.yml
@@ -9,4 +9,7 @@
ansible.builtin.user:
name: "{{ etcd_owner }}"
group: "{{ etcd_group }}"
+# BEGIN PATCH
+ home: "/var/lib/pgsql"
+# END PATCH
become: true
With this patch, this run:
$ ansible-playbook install.yml -i inventory.yml
ends with:
PLAY RECAP *********************************************************************
cn1 : ok=204 changed=49 unreachable=0 failed=0 skipped=129 rescued=0 ignored=0
cn2 : ok=132 changed=30 unreachable=0 failed=0 skipped=118 rescued=0 ignored=0
cn3 : ok=132 changed=30 unreachable=0 failed=0 skipped=118 rescued=0 ignored=0
and cluster looks OK:
# patronictl -c /etc/patroni/cn1.yml list
+ Cluster: main (7304291771927171212) -----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+-----------------+---------+-----------+----+-----------+
| cn1 | 192.168.121.238 | Leader | running | 1 | |
| cn2 | 192.168.121.199 | Replica | streaming | 1 | 0 |
| cn3 | 192.168.121.70 | Replica | streaming | 1 | 0 |
+--------+-----------------+---------+-----------+----+-----------+
[root@cn1 ~]#
Thanks
Hi Pierre,
Thanks for finding this bug, I will be sure to fix this within the setup_etcd
role and enhance the pg_basebackup
configuration within the setup_patroni
role.
Thank you!
Hello,
I'm trying to build a 2 node cluster on RHEL 8.8.
I run:
ansible-playbook install.yml -i inventory.yml
with:
and
Playbook fails with:
On primary node it looks like initdb has not been run:
Could you please tell me what is wrong here ?
Thanks.