YanChii / ansible-role-postgres-ha

Create postgresql HA auto-failover cluster using pcs, pacemaker and PAF
Apache License 2.0
33 stars 22 forks source link

How to replace a node using role's playbook? #28

Closed itxx00 closed 4 years ago

itxx00 commented 4 years ago

Hi, Thanks for this role. Now I have a 3 nodes cluster and 1 node's OS has been reinstalled and hard drive has been replaced with new one , how can I recover this node with playbooks? thanks.

YanChii commented 4 years ago

Hi @itxx00

The role does not do this out of the box but here's what you can do. Adding a node can disrupt running resources so the database needs to be down when adding a node. To achieve that (and not loose the advantage of role's actions before the node join) you have to temporarily edit the role to stop before node add.

Add this task - meta: end_play here.

Now (let's say that pgha3 is the replaced node):

copy recovery template from other node, e.g. pgha1 (adjust path for pg version and cluster name)

scp pgha1:/var/lib/pgsql/10/data/recovery.conf.pgcluster.pcmk /var/lib/pgsql/10/recovery.conf.pgcluster.pcmk

change the application name to the actual node name

sed -i'' -e 's/application_name=pgha1/application_name=pgha3/' /var/lib/pgsql/10/recovery.conf.pgcluster.pcmk

refresh resource errors so the postgres slave starts:

pcs resource cleanup postgres-ha --node pgha3


- now finally re-run the role again

Jan
itxx00 commented 4 years ago

Hi @YanChii , I followed the steps and after disable postgres-ha, seems that cannot add already exists node into cluster, image

YanChii commented 4 years ago

The old node must be removed before adding a new one with the same name.

pcs cluster node remove pgha3 --request-timeout=1

I've updated also my post above.

Jan

itxx00 commented 4 years ago

After pcs resource cleanup postgres-ha --node db01 and re-run rule, the pgsql service did not startup on db01, and playbooks always failed at check if slaves are connected . now the tasks looks like: image

YanChii commented 4 years ago

It is expected. Pls read the above instructions again.

itxx00 commented 4 years ago

After finally re-run the role again, seems still cannot start postgres on db01 :-(

YanChii commented 4 years ago

The postgres must be up BEFORE the last role re-run. Do what's necessary to start it. Resource cleanup, clear, or refresh should be enough (careful, one of them restarts the master resource). Then maybe check the constraints (should not be a problem when the node name is unchanged) or disable/enable the resource. Also check main logs on the new server. Jan

itxx00 commented 4 years ago

Thanks for your help, the lost node be back now, thanks.