YanChii / ansible-role-postgres-ha

Create postgresql HA auto-failover cluster using pcs, pacemaker and PAF
Apache License 2.0
33 stars 22 forks source link

not able to run on centos 6 #1

Closed frank3427 closed 7 years ago

frank3427 commented 7 years ago

i know you said that should run on centos 7 and I am trying to get it to run on centos 6, none of the tasks that create or modify file are not working, also it seems that any task that includes postgres_ha_cluster_master_host variable does not work either.

any help would be greatly appreciated

frank

frank3427 commented 7 years ago

question,

why would this role not be able to edit or make changes on centos 6 system? I find that very odd. if i manual per form setup i get further along. it's just strange that it does not write, edit, update. create files on centos 6.

YanChii commented 7 years ago

Hi @frank3427,

The default clustering stack in centos 6 (cman + rgmanager) is very diferrent from one in centos 7 (corosync + pacemaker). It is not a stopper but honestly I did't think someone would be using it on centos 6 by this time. Anyway, it seems that you have problems also with other parts of the role.

Which tasks exactly are failing? Please share also contents of your cluster.conf. Do other ansible roles/tasks (not from postgres-ha) have the simmilar problems?

If you really want to have this role working on centos 6, we can do it together. What I need so far:

  1. A list of failed tasks with error reports.
  2. Cluster commands for creating cluster and resources that are confirmed to work on centos 6 (you probably already use the corosync/pacemaker stack from epel6 repo) -
  3. A tester.. because I've already migrated all my centoses to 7.

Thanks for the report.

Jan

YanChii commented 7 years ago

For the point 2: It seems that centos 6 guide is quite simmilar to c7: https://dalibo.github.io/PAF/Quick_Start-CentOS-6.html

frank3427 commented 7 years ago

yes, let' get this to work on Centos 6, currently I am stuck on the postgresql_sync.yml section of the role. from the init DB section down nothing is being written on the master host. both servers are fresh minimal installs, in Centos6 need to add libselinux-python to installation os I added a step in pre-task.yml to install it.

here is my run: [root@AnsibleServer ~]# vi dbs3-postgres-ha.yml gather_facts: yes

- any_errors_fatal: true gather_facts: true hosts: dbs6 become: yes name: "install PG HA" pre_tasks:

  name: "disable firewall"
  service: "name=iptables state=stopped enabled=no"

roles:

[root@AnsibleServer ~]# ansible-playbook --ask-pass dbs3-postgres-ha.yml SSH password:

PLAY [install PG HA] ***

TASK [Gathering Facts] ***** ok: [dbs03.prodea-int.net] ok: [dbs04.prodea-int.net]

TASK [disable firewall] **** ok: [dbs03.prodea-int.net] ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : debug] **** ok: [dbs03.prodea-int.net] => { "msg": "MASTER NODE SET TO dbs03.prodea-int.net" }

TASK [postgres-ha6 : verify postgres_ha_cluster_master_host] *** skipping: [dbs03.prodea-int.net] skipping: [dbs04.prodea-int.net]

TASK [postgres-ha6 : yum] ** ok: [dbs03.prodea-int.net] ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : debug] **** ok: [dbs03.prodea-int.net] => { "msg": "cluster_members=[u'dbs03.prodea-int.net', u'dbs04.prodea-int.net']" }

TASK [postgres-ha6 : Build hosts file] ***** changed: [dbs03.prodea-int.net] => (item=dbs03.prodea-int.net) changed: [dbs04.prodea-int.net] => (item=dbs03.prodea-int.net) changed: [dbs03.prodea-int.net] => (item=dbs04.prodea-int.net) changed: [dbs04.prodea-int.net] => (item=dbs04.prodea-int.net)

TASK [postgres-ha6 : install cluster pkgs] ***** ok: [dbs03.prodea-int.net] => (item=[u'pcs', u'pacemaker', u'cman', u'ccs']) ok: [dbs04.prodea-int.net] => (item=[u'pcs', u'pacemaker', u'cman', u'ccs'])

TASK [postgres-ha6 : service pcsd start] *** ok: [dbs03.prodea-int.net] ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : setup hacluster password] ***** ok: [dbs04.prodea-int.net] ok: [dbs03.prodea-int.net]

TASK [postgres-ha6 : setup cluster auth] *** changed: [dbs04.prodea-int.net] changed: [dbs03.prodea-int.net]

TASK [postgres-ha6 : create cluster] *** skipping: [dbs04.prodea-int.net] changed: [dbs03.prodea-int.net]

TASK [postgres-ha6 : join cluster nodes] *** skipping: [dbs04.prodea-int.net] => (item=dbs03.prodea-int.net) failed: [dbs03.prodea-int.net] (item=dbs04.prodea-int.net) => {"changed": true, "cmd": "/bin/sh -c \"if ! grep -q 'ring0_addr[:] dbs04.prodea-int.net[\t ]$' /etc/corosync/corosync.conf; then pcs cluster node add dbs04.prodea-int.net; fi\"", "delta": "0:00:01.731010", "end": "2017-08-28 03:28:23.058813", "failed": true, "item": "dbs04.prodea-int.net", "rc": 1, "start": "2017-08-28 03:28:21.327803", "stderr": "grep: /etc/corosync/corosync.conf: No such file or directory\nError: Unable to add 'dbs04.prodea-int.net' to cluster: node is already in a cluster", "stderr_lines": ["grep: /etc/corosync/corosync.conf: No such file or directory", "Error: Unable to add 'dbs04.prodea-int.net' to cluster: node is already in a cluster"], "stdout": "", "stdout_lines": []}

TASK [postgres-ha6 : start cluster] **** changed: [dbs04.prodea-int.net]

TASK [postgres-ha6 : alter stonith settings] *** ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : alter cluster policy settings] **** ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : alter cluster transition settings] **** ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : verify cluster configuration] ***** changed: [dbs04.prodea-int.net]

TASK [postgres-ha6 : enable cluster autostart] ***** changed: [dbs04.prodea-int.net]

TASK [postgres-ha6 : create virtual IP resource] *** skipping: [dbs04.prodea-int.net]

TASK [postgres-ha6 : import pg96 repo] ***** ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : install epel-release] ***** ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : install pg96] *** ok: [dbs04.prodea-int.net] ((((( from here down there are no changes or steps performed on the master defined host ( dbs03-prodea-int.net) TASK [postgres-ha6 : init DB dir on master if necessary] * skipping: [dbs04.prodea-int.net]

TASK [postgres-ha6 : check if DB was synchronized before] ** ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : alter clustering-related settings in postgresql.conf] ***** skipping: [dbs04.prodea-int.net] => (item={'key': u'hot_standby', 'value': u'on'}) skipping: [dbs04.prodea-int.net] => (item={'key': u'listen_addresses', 'value': u"'*'"}) skipping: [dbs04.prodea-int.net] => (item={'key': u'wal_level', 'value': u'hot_standby'}) skipping: [dbs04.prodea-int.net] => (item={'key': u'wal_log_hints', 'value': u'on'}) skipping: [dbs04.prodea-int.net] => (item={'key': u'max_wal_senders', 'value': u'2'}) skipping: [dbs04.prodea-int.net] => (item={'key': u'max_replication_slots', 'value': u'2'})

TASK [postgres-ha6 : alter DB ACL in pg_hba.conf] ** skipping: [dbs04.prodea-int.net] => (item=dbs04.prodea-int.net)

TASK [postgres-ha6 : alter DB replication ACL in pg_hba.conf] ** skipping: [dbs04.prodea-int.net] => (item=dbs04.prodea-int.net)

TASK [postgres-ha6 : setup DB cluster auth (master IP)] **** ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : setup .pgpass replication auth for master IP] ***** ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : setup .pgpass replication auth for other IPs] ***** ok: [dbs04.prodea-int.net] => (item=dbs04.prodea-int.net)

TASK [postgres-ha6 : check if master host "dbs03.prodea-int.net" is really a DB master] **** skipping: [dbs04.prodea-int.net]

TASK [postgres-ha6 : mark master DB] *** skipping: [dbs04.prodea-int.net]

TASK [postgres-ha6 : check if DB is running (failure is ok)] *** fatal: [dbs04.prodea-int.net]: FAILED! => {"changed": true, "cmd": "/usr/pgsql-9.6/bin/pg_ctl -D /var/lib/pgsql/9.6/data status", "delta": "0:00:00.025437", "end": "2017-08-28 03:28:40.713378", "failed": true, "rc": 4, "start": "2017-08-28 03:28:40.687941", "stderr": "pg_ctl: directory \"/var/lib/pgsql/9.6/data\" is not a database cluster directory", "stderr_lines": ["pg_ctl: directory \"/var/lib/pgsql/9.6/data\" is not a database cluster directory"], "stdout": "", "stdout_lines": []} ...ignoring

TASK [postgres-ha6 : check if DB is running in cluster (failure is OK)] **** fatal: [dbs04.prodea-int.net]: FAILED! => {"changed": true, "cmd": "pcs constraint location show resources \"postgres-ha\" | grep -q Enabled", "delta": "0:00:00.351822", "end": "2017-08-28 03:28:41.872381", "failed": true, "rc": 1, "start": "2017-08-28 03:28:41.520559", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring

TASK [postgres-ha6 : start master DB if necessary (without cluster)] *** skipping: [dbs04.prodea-int.net]

TASK [postgres-ha6 : start master DB if necessary (in cluster)] **** skipping: [dbs04.prodea-int.net]

TASK [postgres-ha6 : setup DB replication auth] **** skipping: [dbs04.prodea-int.net]

TASK [postgres-ha6 : check if DB sync is required] ***** ok: [dbs04.prodea-int.net]

TASK [postgres-ha6 : stop slave DB] **** skipping: [dbs04.prodea-int.net]

TASK [postgres-ha6 : remove slave DB datadir before sync] ** changed: [dbs04.prodea-int.net]

TASK [postgres-ha6 : synchronize slave databases] ** fatal: [dbs04.prodea-int.net]: FAILED! => {"changed": true, "cmd": "/usr/pgsql-9.6/bin/pg_basebackup -h \"172.24.2.187\" -p 5432 -R -D \"/var/lib/pgsql/9.6/data\" -U \"replicator\" -v -P --xlog-method=stream", "delta": "0:00:00.029059", "end": "2017-08-28 03:28:44.872909", "failed": true, "rc": 1, "start": "2017-08-28 03:28:44.843850", "stderr": "pg_basebackup: could not connect to server: could not connect to server: Connection refused\n\tIs the server running on host \"172.24.2.187\" and accepting\n\tTCP/IP connections on port 5432?", "stderr_lines": ["pg_basebackup: could not connect to server: could not connect to server: Connection refused", "\tIs the server running on host \"172.24.2.187\" and accepting", "\tTCP/IP connections on port 5432?"], "stdout": "", "stdout_lines": []} to retry, use: --limit @/root/dbs3-postgres-ha.retry

PLAY RECAP ***** dbs03.prodea-int.net : ok=11 changed=3 unreachable=0 failed=1
dbs04.prodea-int.net : ok=25 changed=8 unreachable=0 failed=1

frank3427 commented 7 years ago

one of the issues is that I do not see a failure for the master node, when looking at the activity logs on the server I am not see hits on the server for the tasks, nor am I seeing any changes to files

frank3427 commented 7 years ago

I am installing using user root if that make a difference

YanChii commented 7 years ago

Hi @frank3427 there are two diferrent errors.

  1. Ansible did not stop after error on the master node (join cluster nodes) even when any_errors_fatal: true is set. I've seen this before and it is probably a bug in ansible itself.
  2. The "join cluster nodes" task expects the corosync.conf file to be present. But it should search for cluster.conf on centos 6.

The diferrences between pcs/corosync/pacemaker stacks in centos 6 and 7 are subtle.. but with strong consequences. I've modified the role and now it runs smoothly even on centos 6. Please try it from centos6 branch and let me know if it runs also for you. But there's still one very important issue - the postgres master is not properly promoted and stays in slave position. PAF developers maintain a separate version for older corosync stack but even with this version installed, it is not working out of the box for me.

@frank3427 can you please help me with debugging that? I've seen several issues about this on PAF github, maybe they will give you hints what to do. Thanks. Jan

frank3427 commented 7 years ago

I rebuilt the vms so we could have a fresh look, I will rebuild as need to confirm changes are working from fresh starts. setting the hacluster password did not work on centos 6 server

[root@localhost ~]# pcs cluster auth dbs03.prodea-int.net dbs04.prodea-int.net -u hacluster Password: Error: dbs03.prodea-int.net: Username and/or password is incorrect Error: dbs04.prodea-int.net: Username and/or password is incorrect [root@localhost ~]#

type=USER_AUTH msg=audit(1503956924.156:576): user pid=8599 uid=0 auid=0 ses=3 msg='op=PAM:authentication acct="hacluster" exe="/usr/bin/ruby" hostname=? addr=? terminal=? res=failed'

TASK [postgres-ha6 : setup hacluster password] ***** changed: [dbs04.prodea-int.net] changed: [dbs03.prodea-int.net]

TASK [postgres-ha6 : setup cluster auth] *** fatal: [dbs04.prodea-int.net]: FAILED! => {"changed": true, "cmd": "pcs cluster auth dbs03.prodea-int.net dbs04.prodea-int.net -u hacluster -p \"Pr0d3aOps\"", "delta": "0:00:06.191536", "end": "2017-08-28 17:13:38.570556", "failed": true, "rc": 1, "start": "2017-08-28 17:13:32.379020", "stderr": "Error: dbs03.prodea-int.net: Username and/or password is incorrect\nError: dbs04.prodea-int.net: Username and/or password is incorrect", "stderr_lines": ["Error: dbs03.prodea-int.net: Username and/or password is incorrect", "Error: dbs04.prodea-int.net: Username and/or password is incorrect"], "stdout": "", "stdout_lines": []} fatal: [dbs03.prodea-int.net]: FAILED! => {"changed": true, "cmd": "pcs cluster auth dbs03.prodea-int.net dbs04.prodea-int.net -u hacluster -p \"Pr0d3aOps\"", "delta": "0:00:06.202676", "end": "2017-08-28 17:13:38.584373", "failed": true, "rc": 1, "start": "2017-08-28 17:13:32.381697", "stderr": "Error: dbs03.prodea-int.net: Username and/or password is incorrect\nError: dbs04.prodea-int.net: Username and/or password is incorrect", "stderr_lines": ["Error: dbs03.prodea-int.net: Username and/or password is incorrect", "Error: dbs04.prodea-int.net: Username and/or password is incorrect"], "stdout": "", "stdout_lines": []} to retry, use: --limit @/root/dbs3-postgres-ha.retry

PLAY RECAP ***** dbs03.prodea-int.net : ok=9 changed=5 unreachable=0 failed=1
dbs04.prodea-int.net : ok=7 changed=5 unreachable=0 failed=1

[root@AnsibleServer ~]#

so to get passed this I manually reset the password on each server and then [root@localhost ~]# pcs cluster auth dbs03.prodea-int.net dbs04.prodea-int.net -u hacluster Password: dbs03.prodea-int.net: Authorized dbs04.prodea-int.net: Authorized

so I think the method for generating the hash is not working, I used the following openssl passwd -1 -salt xyz (password) and that gets me passed this issue on to the next one.

frank3427 commented 7 years ago

we are now back to postgres initdb task.

trying

tail -f /var/log/messages on the (master host = dbs03) shows no activity after

Aug 28 22:58:02 localhost ansible-command: Invoked with warn=True executable=None _uses_shell=True _raw_params=pcs cluster auth dbs03.prodea-int.net dbs04.prodea-int.net -u hacluster -p "Pr0d3aOps" removes=None creates=None chdir=None Aug 28 22:58:04 localhost ansible-command: Invoked with creates=/etc/corosync/corosync.conf executable=None _uses_shell=True _raw_params=pcs cluster --force setup --name pgcluster "dbs03.prodea-int.net" removes=None warn=True chdir=None Aug 28 22:58:13 localhost ansible-command: Invoked with warn=True executable=None _uses_shell=True _raw_params=/bin/sh -c "if ! grep -q 'ring0_addr[:] dbs04.prodea-int.net[\t ]$' /etc/corosync/corosync.conf; then pcs cluster node add dbs04.prodea-int.net; fi" removes=None creates=None chdir=None

YanChii commented 7 years ago

All outputs look exactly the same as before. You are therefore running the old version of the role. Are you sure you have switched the git branch to centos6? The "setup cluster auth" should not be run anymore on centos 6.