itxx00 commented 4 years ago

Hi， Thanks for this role. Now I have a 3 nodes cluster and 1 node's OS has been reinstalled and hard drive has been replaced with new one , how can I recover this node with playbooks? thanks.

YanChii commented 4 years ago

Hi @itxx00

The role does not do this out of the box but here's what you can do. Adding a node can disrupt running resources so the database needs to be down when adding a node. To achieve that (and not loose the advantage of role's actions before the node join) you have to temporarily edit the role to stop before node add.

Add this task - meta: end_play here.

Now (let's say that pgha3 is the replaced node):

run the role

when it exits, shutdown the DB and add the node manually

pcs cluster node remove pgha3 --request-timeout=1
# outage start
pcs resource disable postgres-ha
# wait for stop
pcs cluster node add pgha3
ssh pgha3 pcs cluster start
pcs resource enable postgres-ha
# outage end

revert role edits you have made before
re-run the role
now it fails on waiting for all slaves connected (but we've made it further)

then on node3:


yum install https://github.com/YanChii/ansible-role-postgres-ha/raw/master/files/resource-agents-paf-2.2.1-1.noarch.rpm

copy recovery template from other node, e.g. pgha1 (adjust path for pg version and cluster name)

scp pgha1:/var/lib/pgsql/10/data/recovery.conf.pgcluster.pcmk /var/lib/pgsql/10/recovery.conf.pgcluster.pcmk

change the application name to the actual node name

sed -i'' -e 's/application_name=pgha1/application_name=pgha3/' /var/lib/pgsql/10/recovery.conf.pgcluster.pcmk

refresh resource errors so the postgres slave starts:

pcs resource cleanup postgres-ha --node pgha3


- now finally re-run the role again

Jan

itxx00 commented 4 years ago

Hi @YanChii , I followed the steps and after disable postgres-ha, seems that cannot add already exists node into cluster,

YanChii commented 4 years ago

The old node must be removed before adding a new one with the same name.

pcs cluster node remove pgha3 --request-timeout=1

I've updated also my post above.

Jan

itxx00 commented 4 years ago

After pcs resource cleanup postgres-ha --node db01 and re-run rule, the pgsql service did not startup on db01, and playbooks always failed at check if slaves are connected . now the tasks looks like:

YanChii commented 4 years ago

It is expected. Pls read the above instructions again.

itxx00 commented 4 years ago

After finally re-run the role again, seems still cannot start postgres on db01 :-(

YanChii commented 4 years ago

The postgres must be up BEFORE the last role re-run. Do what's necessary to start it. Resource cleanup, clear, or refresh should be enough (careful, one of them restarts the master resource). Then maybe check the constraints (should not be a problem when the node name is unchanged) or disable/enable the resource. Also check main logs on the new server. Jan

itxx00 commented 4 years ago

Thanks for your help, the lost node be back now, thanks.

YanChii / ansible-role-postgres-ha

How to replace a node using role's playbook? #28

copy recovery template from other node, e.g. pgha1 (adjust path for pg version and cluster name)

change the application name to the actual node name

refresh resource errors so the postgres slave starts: