YanChii / ansible-role-postgres-ha

Create postgresql HA auto-failover cluster using pcs, pacemaker and PAF
Apache License 2.0
33 stars 22 forks source link

PostgreSQL 11, PAF 2.2.1 + geo-patch = playbook succeeds but master flip flops #25

Closed tamer-hassan closed 5 years ago

tamer-hassan commented 5 years ago

on fresh run on CentOS 7.

If I managed to catch the log from the start (as soon as the db data is initialized on master)

[root@prometheus-dev01 ~]# tail -f /var/lib/pgsql/11/data/log/postgresql-Wed.log 
2019-02-13 00:59:19.503 UTC [28138] LOG:  database system was shut down at 2019-02-13 00:59:15 UTC
2019-02-13 00:59:19.508 UTC [28136] LOG:  database system is ready to accept connections
2019-02-13 00:59:24.375 UTC [28136] LOG:  received SIGHUP, reloading configuration files
2019-02-13 00:59:32.239 UTC [28136] LOG:  received fast shutdown request
2019-02-13 00:59:32.240 UTC [28136] LOG:  aborting any active transactions
2019-02-13 00:59:32.241 UTC [28136] LOG:  background worker "logical replication launcher" (PID 28144) exited with exit code 1
2019-02-13 00:59:32.242 UTC [28139] LOG:  shutting down
2019-02-13 00:59:32.423 UTC [28136] LOG:  database system is shut down

2019-02-13 00:59:40.780 UTC [29677] LOG:  database system was shut down at 2019-02-13 00:59:32 UTC
2019-02-13 00:59:40.781 UTC [29677] LOG:  entering standby mode
2019-02-13 00:59:40.783 UTC [29677] LOG:  consistent recovery state reached at 0/3000098
2019-02-13 00:59:40.783 UTC [29677] LOG:  invalid record length at 0/3000098: wanted 24, got 0
2019-02-13 00:59:40.784 UTC [29673] LOG:  database system is ready to accept read only connections
2019-02-13 00:59:43.847 UTC [29682] FATAL:  could not connect to the primary server: could not connect to server: No route to host
                Is the server running on host "192.168.30.183" and accepting
                TCP/IP connections on port 5432?
2019-02-13 00:59:44.670 UTC [29677] LOG:  received promote request
2019-02-13 00:59:44.670 UTC [29779] FATAL:  terminating walreceiver process due to administrator command
2019-02-13 00:59:44.670 UTC [29677] LOG:  redo is not required
2019-02-13 00:59:44.763 UTC [29677] LOG:  selected new timeline ID: 2
2019-02-13 00:59:45.045 UTC [29677] LOG:  archive recovery complete
2019-02-13 00:59:45.256 UTC [29673] LOG:  database system is ready to accept connections

2019-02-13 01:00:01.894 UTC [29673] LOG:  received fast shutdown request
2019-02-13 01:00:01.894 UTC [29673] LOG:  aborting any active transactions
2019-02-13 01:00:01.896 UTC [29673] LOG:  background worker "logical replication launcher" (PID 29846) exited with exit code 1
2019-02-13 01:00:01.896 UTC [29679] LOG:  shutting down
2019-02-13 01:00:01.916 UTC [29673] LOG:  database system is shut down
2019-02-13 01:00:02.042 UTC [30825] LOG:  database system was shut down at 2019-02-13 01:00:01 UTC
2019-02-13 01:00:02.043 UTC [30825] LOG:  entering standby mode
2019-02-13 01:00:02.045 UTC [30825] LOG:  consistent recovery state reached at 0/30002C0
2019-02-13 01:00:02.045 UTC [30825] LOG:  invalid record length at 0/30002C0: wanted 24, got 0
2019-02-13 01:00:02.046 UTC [30823] LOG:  database system is ready to accept read only connections
2019-02-13 01:00:02.049 UTC [30830] FATAL:  pg_hba.conf rejects replication connection for host "192.168.30.181", user "replicator", SSL off
2019-02-13 01:00:02.049 UTC [30829] FATAL:  could not connect to the primary server: FATAL:  pg_hba.conf rejects replication connection for host "192.168.30.181", user "replicator", SSL off
2019-02-13 01:00:02.053 UTC [30832] FATAL:  pg_hba.conf rejects replication connection for host "192.168.30.181", user "replicator", SSL off
2019-02-13 01:00:02.053 UTC [30831] FATAL:  could not connect to the primary server: FATAL:  pg_hba.conf rejects replication connection for host "192.168.30.181", user "replicator", SSL off
2019-02-13 01:00:02.438 UTC [30823] LOG:  received fast shutdown request
2019-02-13 01:00:02.451 UTC [30823] LOG:  aborting any active transactions
2019-02-13 01:00:06.922 UTC [30884] FATAL:  the database system is shutting down
2019-02-13 01:00:07.457 UTC [30826] LOG:  shutting down
2019-02-13 01:00:07.462 UTC [30823] LOG:  database system is shut down
2019-02-13 01:00:07.848 UTC [30911] LOG:  database system was shut down in recovery at 2019-02-13 01:00:07 UTC
2019-02-13 01:00:07.848 UTC [30911] LOG:  entering standby mode
tamer-hassan commented 5 years ago

While following the waves of log lines in /var/log/cluster/corosync.log , I noticed an interesting warning, because the hostname was truncated ("prometheus" instead of "prometheus-dev01" or "-dev02")

pgsqlms(postgres)[21947]:       Feb 13 02:44:32  WARNING: "prometheus" is not connected to the primary

Traced it down to the regex pattern line in "sub _get_all_cluster_nodes", particularly this one: https://github.com/YanChii/ansible-role-postgres-ha/blob/master/files/pgsqlms-2.2.1-geo-patched#L843

$_ =~ m/\d\s([a-zA-Z0-9_]+).*$/;

Underscore is not valid char in hostnames, so I removed the underscore, and added the missing dot "." and dash "-" , since hostnames can also have dashes or dots; "example-host.name.com" :) so the correct line of code would look like:

$_ =~ m/\d\s([a-zA-Z0-9\-\.]+).*$/;

reran playbook on clean VMs and problem solved.

YanChii commented 5 years ago

Thank you @tamer-hassan! I didn't catch this one. Fixed in master. Jan