YanChii / ansible-role-postgres-ha

Create postgresql HA auto-failover cluster using pcs, pacemaker and PAF
Apache License 2.0
33 stars 22 forks source link

Planning to use Ubuntu #30

Closed readysetawesome closed 3 years ago

readysetawesome commented 3 years ago

I want to adapt this role to work on an Ubuntu distro. I know little about the system-level differences so I'm not sure how hard this will be. Can you provide any guidance?

YanChii commented 3 years ago

Hi @readysetawesome

In theory, it shouldn't be that hard. A quick googling has told me that Ubuntu is using pcs command for pacemaker (unlike OpenSuse).

Base your work on this branch where I've migrated to newer pcs library.

As a first thing, watch for when: ansible_distribution statements and add appropriate ubuntu match/command. Also this fact is quite important for decisions later.

The rest is just running the role inside some test env and fixing the errors as they pop up.

In the first run, you can omit some advanced commands as ring or mcast settings.

Jan

readysetawesome commented 3 years ago

Thanks for sharing your thoughts, I will follow the advice.

I'm in the midst of building up and tearing down some environments repeatedly, so this is a good time to test it and figure it out.

readysetawesome commented 3 years ago

After some adjustments, it is mostly working for me with postgres-12 on Ubuntu 20.04 I'm using the latest PAF 2.3.0 & pgsqlms scripts I have plugged in a custom script to manage the floating IP service provided by digitialocean, which works ok on its own (before postgres-ha resources are added) Replication is up following successful run of the Role. The remaining mystery is my resource configuration which seems to be a little off:

$ sudo pcs resource
  * pg-vip  (ocf::digitalocean:floatip):     Stopped
  * Clone Set: postgres-ha [postgres] (promotable):
    * postgres  (ocf::heartbeat:pgsqlms):    FAILED node01 (blocked)
    * postgres  (ocf::heartbeat:pgsqlms):    FAILED node02 (blocked)

Seems like I'm really close... I think I need to unpack some log data about how the decisions are being made by pacemaker... it says it is trying to stop postgres on both master/slave and in both cases, halts itself based on unexpected state:

Mar 20 00:53:05 node02 pacemaker-controld[17532]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
Mar 20 00:53:05 node02 pacemaker-schedulerd[17531]:  warning: Unexpected result (error: Unexpected state for instance "postgres" (returned 1)) was recorded for stop of postgres:0 on node02 at Mar 20 00:37:58 2021 
readysetawesome commented 3 years ago

May have found the issue here, which is that the pgsqlms script has an incompatible default location for pg socket file...

$ /usr/lib/postgresql/12/bin/pg_isready -h /tmp -p 5432 -d postgres -t 90
/tmp:5432 - no response

I validated my theory below, but need to confirm it with a playbook run as well.

$ /usr/lib/postgresql/12/bin/pg_isready -h /var/run/postgresql -p 5432 -d postgres -t 90
/var/run/postgresql:5432 - accepting connections
readysetawesome commented 3 years ago

Fixed:

$ sudo pcs resource
  * pg-vip  (ocf::digitalocean:floatip):     Started pgsql-nyc1-01
  * Clone Set: postgres-ha [postgres] (promotable):
    * Masters: [ pgsql-nyc1-01 ]
    * Slaves: [ pgsql-nyc1-02 ]

so cool!

Thanks again for the tips. I will try to make my work reusable for others with a pull request at some point.

readysetawesome commented 3 years ago

Hmm, that was a fluke, unfortunately. There seems to be some issue with timing/ordering of the resources starting once the database sync is complete and all slaves are verified, pcs's postrgres-ha master promotion immediately fails on the primary and the secondary PGSQL nodes.

Can you explain how the master selection is supposed to work for the postgres-ha resource? i.e. how is the pcs toolchain configured for awareness of which postgres-ha node is master on the very first run? (i.e. the step where we pcs enable/manage on a brand new postgres cluster)

I'm seeing this after several failed promotions but I don't think it is the real error, rather a side effect of root cause:

Failed Resource Actions:
  * postgres_promote_0 on pgsql-nyc1-01 'error' (1): call=30, stat
us='complete', exitreason='Can not get current node LSN location',
 last-rc-change='2021-03-20 23:42:25Z', queued=0ms, exec=378ms
readysetawesome commented 3 years ago

I'm going to start over with the manage/enable resource tasks commented out and see if I can get a clearer picture of whats happening in syslog and the pacemaker logs.

readysetawesome commented 3 years ago

Latest PAF pgsqlms script (2.3.0) with postgres-12 expects a primary_conninfo in the pg config now rather than recovery conf... so I added a lineinfile task to write that out. Somehow, PCS isn't getting the message:

Parameter "primary_conninfo" MUST contain 'application_name=pgsql-nyc1-02'. It is currently set to 'user=replicator passfile='/v'

I can see that my conf includes this line on the hosts, and verified that pcs is configured to use the same postgres conf. I have no idea where the pgsqlms script is getting this primary_conninfo from...

YanChii commented 3 years ago

Hi, Nice work so far. According to http://clusterlabs.github.io/PAF/Quick_Start-Debian-10-pcs.html the primary_conninfo should be in postgresql.conf. Maybe also try pgsql11 cluster whether old way with recovery.conf works fine. That way we'll know that the only issue is version 12 with changed params. Jan

readysetawesome commented 3 years ago

Yeah, I know it worked once after I removed the old recovery.conf template installation steps. After studying my host environment's network more carefully I discovered that these are embracing multiple IPs and routing a bit different than what my pgsql replication config expects. At least that's what it seems to be after carefully inspecting the very first promotion with pcs resource debug-promote

 >  stdout:     FATAL:  no pg_hba.conf entry for replication connection from host "10.10.0.2", user "replicator", SSL off

All my configs in the role were initially using the default/public IP interface (this is just how my inventory plugin works) so I tried different interface address in my host's eth configs, in the 10.0.0. space. That may have worked once by accident (I mentioned above) when the interfaces were bound in a different order by my cloud provider. I'm going to try to rectify this by choosing the first ipv4 addr from a different bucket ipv4_secondaries provided by my inventory plugin+cloud provider and see if it gets routed consistently.

readysetawesome commented 3 years ago

Man the errors are just endless. Primary master goes into standby immediately every time and I don't understand why. For pg12+ the only requirement to bring up a DB as master is to remove the standby.signal file in the pgdata directory, which I have done repeatedly in my hands-on attempts to get the cluster into a working state:

$ sudo pcs resource debug-start postgres
Operation start for postgres (ocf:heartbeat:pgsqlms) returned: 'ok' (0)
 >  stdout: /var/run/postgresql:5432 - no response
 >  stdout: pg_ctl: no server running
 >  stdout: waiting for server to start....2021-03-21 19:40:47.818 UTC [44642] LOG:  starting PostgreSQL 12.6 (Ubuntu 12.6-0ubuntu0.20.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
 >  stdout: 2021-03-21 19:40:47.818 UTC [44642] LOG:  listening on IPv4 address "0.0.0.0", port 5432
 >  stdout: 2021-03-21 19:40:47.818 UTC [44642] LOG:  listening on IPv6 address "::", port 5432
 >  stdout: 2021-03-21 19:40:47.819 UTC [44642] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
 >  stdout: 2021-03-21 19:40:47.831 UTC [44643] LOG:  database system was shut down in recovery at 2021-03-21 19:38:50 UTC
 >  stdout: 2021-03-21 19:40:47.831 UTC [44643] LOG:  entering standby mode
 >  stdout: 2021-03-21 19:40:47.832 UTC [44643] LOG:  consistent recovery state reached at 0/30000D8
 >  stdout: 2021-03-21 19:40:47.832 UTC [44643] LOG:  invalid record length at 0/30000D8: wanted 24, got 0
 >  stdout: 2021-03-21 19:40:47.833 UTC [44642] LOG:  database system is ready to accept read only connections
 >  stdout: 2021-03-21 19:40:47.839 UTC [44647] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
 >  stdout:         Is the server running on host "174.138.110.252" and accepting
 >  stdout:         TCP/IP connections on port 5432?
 >  stdout: 2021-03-21 19:40:47.842 UTC [44648] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
 >  stdout:         Is the server running on host "174.138.110.252" and accepting
 >  stdout:         TCP/IP connections on port 5432?
 >  stdout:  done
 >  stdout: server started
 >  stdout: /var/run/postgresql:5432 - accepting connections
 >  stderr: Mar 21 19:40:47  INFO: Instance "postgres" started
readysetawesome commented 3 years ago

The problem lies somewhere in the pcs resource/constraints configuration or the calls to the pgsqlms resource management script, that is 100% certain. pcs resource commands always leave the database in this recovery state on master (even if the recovery signal is removed):

postgres: 12/main: startup   recovering 000000010000000000000003

While using system service manager works fine:

pgsql@pgsql-nyc1-01:~$ sudo rm /var/lib/postgresql/12/main/standby.signal
pgsql@pgsql-nyc1-01:~$ sudo service postgresql start
pgsql@pgsql-nyc1-01:~$ ps aux | grep postgres | grep -v grep
postgres   50353  0.1  2.9 217156 29176 ?        Ss   19:51   0:00 /usr/lib/postgresql/12/bin/postgres -D /var/lib/postgresql/12/main -c config_file=/etc/postgresql/12/main/postgresql.conf
postgres   50355  0.0  0.4 217156  4360 ?        Ss   19:51   0:00 postgres: 12/main: checkpointer   
postgres   50356  0.0  0.5 217156  5828 ?        Ss   19:51   0:00 postgres: 12/main: background writer   
postgres   50357  0.0  0.4 217156  4360 ?        Ss   19:51   0:00 postgres: 12/main: walwriter   
postgres   50358  0.0  0.7 217696  7076 ?        Ss   19:51   0:00 postgres: 12/main: autovacuum launcher   
postgres   50359  0.0  0.4  71776  4976 ?        Ss   19:51   0:00 postgres: 12/main: stats collector   

how can I further debug the startup from pacemaker's side? Something is happening that I am not seeing.

My thinking is I will next attempt to stop the role after creation of the postgres-ha resource, manually add the pcs constraints to the disabled resource, one-by-one. and watch the state. this seems to be where things get into a bad state initially.

readysetawesome commented 3 years ago

Trying with all constraints disabled - I paused the role after adding postgres-ha resource with the "--disabled" flag. But still the cluster is trying to do things with the resource before it is enabled:

  * postgres_stop_0 on pgsql-nyc1-02 'invalid parameter' (2): ca
ll=11, status='complete', exitreason='Parameter "primary_conninf
o" MUST contain 'application_name=pgsql-nyc1-02'. It is currentl
y set to 'user=replicator passfile='/v', last-rc-change='2021-03
-21 20:13:33Z', queued=0ms, exec=144ms

I don't understand why a --disabled resource would start to affect cluster actions immediately.

YanChii commented 3 years ago

Disabling calls stop. And the stopping fails. So I think it's retrying it. Setting the primary_conninfo to the requested value could help.

All pg nodes start in recovery mode first. Only after that one of them is promoted to master. Signal handling should be done by PAF itself (see respective paf commit that adds pgsql12 support).

readysetawesome commented 3 years ago

primary_conninfo is set correctly. Here I debug the setting using the command I took from PAF's management script:

$ /usr/lib/postgresql/12/bin/postgres -C primary_conninfo -D "/var/lib/postgresql/12/main" -c config_file=/etc/postgresql/12/main/postgresql.conf
2021-03-21 21:15:03.759 GMT [37317] LOG:  skipping missing configuration file "/var/lib/postgresql/12/main/postgresql.auto.conf"
port=5432 host=174.138.110.252 application_name=pgsql-nyc1-02 user=replicator 

It is almost comical how complicated this process is, I am chuckling a bit. Like some kind of biblical test of true patience.

readysetawesome commented 3 years ago

ah, different result when i run the command as the postgres user:

$ /usr/lib/postgresql/12/bin/postgres -C primary_conninfo -D "/var/lib/postgresql/12/main" -c config_file=/etc/postgresql/12/main/postgresql.conf
user=replicator passfile='/var/lib/postgresql/.pgpass' host=161.35.52.75 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any

I don't understand why it is different. There is only one postgres.conf in existence on this machine, which contains the correct value. where on earth is the other value coming from?

I am stumped for now.

readysetawesome commented 3 years ago

LOL, if you look at the output above you'll see the first one is run by a user who doesn't have privilege on postgresql.auto.conf which is overriding the setting my role adds to postgresql.conf

I'm not going to be naive and say this is almost working but again it seems like progress. Hopefully not another mirage!

(EDIT) I think it has to do with my installation of PG12 - apt is starting up the hosts and they generate their own auto.conf as part of first start or something. Will try removing this conf after stopping the cluster and before the pcs_resource step.

readysetawesome commented 3 years ago

Still not working. I'm going in circles now, seen this error before:

Failed Resource Actions:
  * postgres_promote_0 on pgsql-nyc1-01 'error' (1): call=17, st
atus='complete', exitreason='Can not get current node LSN locati
on', last-rc-change='2021-03-21 22:12:09Z', queued=0ms, exec=374
ms
$ sudo pcs resource debug-promote postgres
Operation promote for postgres (ocf:heartbeat:pgsqlms) returned: 'error' (1)
 >  stdout: /var/run/postgresql:5432 - accepting connections
 >  stderr: 2021-03-21 22:14:48.448 GMT [35591] LOG:  skipping missing configuration file "/var/lib/postgresql/12/main/postgresql.auto.conf"
 >  stderr: 2021-03-21 22:14:48.489 GMT [35594] LOG:  skipping missing configuration file "/var/lib/postgresql/12/main/postgresql.auto.conf"
 >  stderr: Could not query value of cancel_switchover-postgres: attribute does not exist
 >  stderr: Could not query value of recover_master-postgres: attribute does not exist
 >  stderr: Could not query value of nodes-postgres: attribute does not exist
 >  stderr: Could not query value of lsn_location-postgres: attribute does not exist
 >  stderr: ocf-exit-reason:Can not get current node LSN location
readysetawesome commented 3 years ago

UPDATE: So now the question I'm trying to answer is why the new master's lsn_location isn't being set by the pre-promote action

YanChii commented 3 years ago

Well, this looks like an issue with PAF itself, not the role. Does setting the resource manually (without this role, according to PAF docs) make any difference?

readysetawesome commented 3 years ago

Thanks for the suggestion! I found a missing setting when I was looking through the PAF docs for the promotable option, I was missing notify=true at the very end. I have a working cluster for the second time now, but came up a bit sloppy and promoted host 2 for some reason: Screen Shot 2021-03-21 at 5 45 19 PM At least the floating IP did follow the promotion :) which was cool I will read additionally through the PAF docs, maybe I am missing some other detail. Or maybe with the latest versions of PAF I don't get to choose a master when 2 pgsql nodes are synchronized and promotable?

I seriously appreciate all of your comments, helping to get me focused on the right pieces in this big ol' puzzle.

(EDIT2) There must be some kind of race condition, I re-ran the role on fresh hardware and got the expected outcome with Node 1 as master: Screen Shot 2021-03-21 at 6 26 04 PM

WOOHOO! Time to celebrate. Hopefully I can get this all cleaned up and make it useful with a pull request eventually.