ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
493 stars 582 forks source link

PSQL cluster failed without promoting secondary to master #1321

Open dancica opened 5 years ago

dancica commented 5 years ago

Hi, Can you help me with troubleshooting postgres pacemaker cluster failure? Today cluster failed without promoting secondary to master. At the same time appeared ldap time out. Here are the logs, master was stopped by pacemaker at 10:03:40 AM UTC. Thank you in advance.

corosync.log

Apr 17 10:03:34 master crmd[12481]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-secondary on master: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-master on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for PGSQL:1 on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary away from master after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: notice: Recover PGSQL:1 (Slave secondary) Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3461: /var/lib/pacemaker/pengine/pe-input-58.bz2 Apr 17 10:03:34 master pengine[12480]: notice: On loss of CCM Quorum: Ignore Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-secondary on master: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for fencing-master on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op monitor for PGSQL:1 on secondary: unknown error (1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary away from master after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away from secondary after 1 failures (max=1) Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary) Apr 17 10:03:34 master pengine[12480]: notice: Calculated Transition 3462: /var/lib/pacemaker/pengine/pe-input-59.bz2 Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000 process (PID 32372) timed out Apr 17 10:03:40 master lrmd[12477]: warning: PGSQL_monitor_15000:32372 - timed out after 10000ms Apr 17 10:03:40 master crmd[12481]: notice: Transition aborted by PGSQL_monitor_15000 'modify' on master: Old event (magic=2:1;8:7:8:319e4083-ccc0-440a-ae43-1bbd39275fe7, cib=0.93.14, source=process_graph_event:605, 0) Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400] Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168] Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] qb_ipcs_disconnect(23321-32400-25) state:2 Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file descriptor (9) Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-response-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-event-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-request-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32400] Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168] Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32400-25) Apr 17 10:03:40 master corosync[23321]: [QB ] qb_ipcs_disconnect(23321-32400-25) state:2 Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file descriptor (9) Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-23321-32400-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-23321-32400-25-header Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: _get_controldata: found: { Apr 17 10:03:40 master pgsqlms(PGSQL)[32393]: DEBUG: pgsql_notify: environment variables: { Apr 17 10:03:40 master crmd[12481]: notice: Operation PGSQL_notify_0: ok (node=master, call=52, rc=0, cib-update=0, confirmed=true) Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5, Pending=0, Fired=0, Skipped=1, Incomplete=6, Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)> ldap_result() timed out Apr 17 10:03:40 master pengine[12480]: notice: On loss of CCM Quorum: Ignore Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op monitor for fencing-secondary on master: unknown error (1) Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op monitor for PGSQL:0 on master: unknown error (1) Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op monitor for fencing-master on secondary: unknown error (1) Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op monitor for PGSQL:1 on secondary: unknown error (1) Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from master after 1 failures (max=1) Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from master after 1 failures (max=1) Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary away from master after 1 failures (max=1) Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master away from secondary after 1 failures (max=1) Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from secondary after 1 failures (max=1) Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away from secondary after 1 failures (max=1) Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master) Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master -> Stopped master) Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 (secondary) Apr 17 10:03:40 master pengine[12480]: notice: Calculated Transition 3463: /var/lib/pacemaker/pengine/pe-input-60.bz2 Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated (23321-32414-25) Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32414] Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168] Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32414-25) Apr 17 10:03:40 master corosync[23321]: [QB ] qb_ipcs_disconnect(23321-32414-25) state:2 Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file descriptor (9) Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-response-23321-32414-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-event-23321-32414-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-request-23321-32414-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated (23321-32414-25) Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32414] Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168] Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32414-25) Apr 17 10:03:40 master corosync[23321]: [QB ] qb_ipcs_disconnect(23321-32414-25) state:2 Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file descriptor (9) Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-23321-32414-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-23321-32414-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-23321-32414-25-header Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _get_controldata: found: { Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: pgsql_notify: environment variables: { Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _runas: launching as "postgres" command "/usr/lib/postgresql/9.5/bin/pg_isready -h /var/run/postgresql/ -p 5432" Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: pgsql_monitor: instance "PGSQL" is listening Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _query: SELECT pg_is_in_recovery() Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _query: psql return code: 0 Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _query: @res: [ Apr 17 10:03:40 master pgsqlms(PGSQL)[32407]: DEBUG: _confirm_role: instance PGSQL is a primary Apr 17 10:03:40 master crmd[12481]: notice: Operation PGSQL_notify_0: ok (node=master, call=53, rc=0, cib-update=0, confirmed=true) Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated (23321-32438-25) Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32438] Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168] Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32438-25) Apr 17 10:03:40 master corosync[23321]: [QB ] qb_ipcs_disconnect(23321-32438-25) state:2 Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file descriptor (9) Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-response-23321-32438-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-event-23321-32438-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cpg-request-23321-32438-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] IPC credentials authenticated (23321-32438-25) Apr 17 10:03:40 master corosync[23321]: [QB ] connecting to client [32438] Apr 17 10:03:40 master corosync[23321]: [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168 Apr 17 10:03:40 master corosync[23321]: message repeated 2 times: [ [QB ] shm size:1048589; real_size:1052672; rb->word_size:263168] Apr 17 10:03:40 master corosync[23321]: [QB ] HUP conn (23321-32438-25) Apr 17 10:03:40 master corosync[23321]: [QB ] qb_ipcs_disconnect(23321-32438-25) state:2 Apr 17 10:03:40 master corosync[23321]: [QB ] epoll_ctl(del): Bad file descriptor (9) Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-23321-32438-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-23321-32438-25-header Apr 17 10:03:40 master corosync[23321]: [QB ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-23321-32438-25-header Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _get_controldata: found: { Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _runas: launching as "postgres" command "/usr/lib/postgresql/9.5/bin/pg_isready -h /var/run/postgresql/ -p 5432" Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: pgsql_monitor: instance "PGSQL" is listening Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _query: SELECT pg_is_in_recovery() Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _query: psql return code: 0 Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _query: @res: [ Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _confirm_role: instance PGSQL is a primary Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: pgsql_demote: "PGSQL" currently running as a primary Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _get_action_timeout: known timeout: 120 Apr 17 10:03:40 master pgsqlms(PGSQL)[32431]: DEBUG: _runas: launching as "postgres" command "/usr/lib/postgresql/9.5/bin/pg_ctl --pgdata /var/lib/postgresql/9.5/main --mode fast -w --timeout 180 stop" Apr 17 10:03:40 master postgresql@9.5-main[32458]: Cluster is not running. Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Control process exited, code=exited status=2 Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Unit entered failed state. Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Failed with result 'exit-code'. Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _runas: launching as "postgres" command "/usr/lib/postgresql/9.5/bin/pg_isready -h /var/run/postgresql/ -p 5432" Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: pgsql_monitor: instance "PGSQL" is not listening Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _runas: launching as "postgres" command "/usr/lib/postgresql/9.5/bin/pg_isready -h /var/run/postgresql/ -p 5432" Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: pgsql_monitor: instance "PGSQL" is not listening Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _get_controldata: found: { Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: pgsql_start: instance "PGSQL" is not running, starting it as a secondary Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _create_recovery_conf: get replication configuration from the template file "/etc/postgresql/9.5/main/recovery.conf.pcmk" Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _create_recovery_conf: write the replication configuration to "/var/lib/postgresql/9.5/main/recovery.conf" file Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _get_action_timeout: known timeout: 120 Apr 17 10:03:41 master pgsqlms(PGSQL)[32431]: DEBUG: _runas: launching as "postgres" command "/usr/lib/postgresql/9.5/bin/pg_ctl --pgdata /var/lib/postgresql/9.5/main -w --timeout 180 s

postgres.log

2019-04-17 10:03:40 UTC LOG: received fast shutdown request 2019-04-17 10:03:40 UTC LOG: aborting any active transactions 2019-04-17 10:03:40 UTC FATAL: terminating connection due to administrator command 2019-04-17 10:03:40 UTC LOG: autovacuum launcher shutting down 2019-04-17 10:03:40 UTC LOG: shutting down 2019-04-17 10:03:40 UTC LOG: database system is shut down syslog_1704.txt

Installed packages:

hi corosync 2.3.5-3ubuntu2.1 amd64 cluster engine daemon and utilities ii libcorosync-common4:amd64 2.3.5-3ubuntu2.1 amd64 cluster engine common library hi crmsh 2.2.0-1 amd64 CRM shell for the pacemaker cluster manager hi pacemaker 1.1.14-2ubuntu1.4 amd64 cluster resource manager ii pacemaker-cli-utils 1.1.14-2ubuntu1.4 amd64 cluster resource manager command line utilities ii pacemaker-common 1.1.14-2ubuntu1.4 all cluster resource manager common files ii pacemaker-resource-agents 1.1.14-2ubuntu1.4 all cluster resource manager general resource agents

OS version:

Distributor ID: Ubuntu Description: Ubuntu 16.04.5 LTS Release: 16.04 Codename: xenial

Cluster configuration:

node 1: secondary \ attributes master-PGSQL=1000 node 2: master \ attributes master-PGSQL=1001 primitive AWSVIP awsvip \ params secondary_private_ip=10.x.x.x api_delay=5 primitive PGSQL pgsqlms \ params pgdata="/var/lib/postgresql/9.5/main" bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/" recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \ op start timeout=60s interval=0 \ op stop timeout=60s interval=0 \ op promote timeout=15s interval=0 \ op demote timeout=120s interval=0 \ op monitor interval=15s timeout=10s role=Master \ op monitor interval=16s timeout=10s role=Slave \ op notify timeout=60 interval=0 primitive fencing-master stonith:external/ec2 \ params port=master \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s primitive fencing-secondary stonith:external/ec2 \ params port=secondary \ op start interval=0s timeout=60s \ op monitor interval=360s timeout=60s \ op stop interval=0s timeout=60s ms PGSQL-HA PGSQL \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master order demote-then-stop-ip Mandatory: _rscset PGSQL-HA:demote AWSVIP:stop symmetrical=false location loc-fence-master fencing-master -inf: master location loc-fence-secondary fencing-secondary -inf: secondary order promote-then-ip Mandatory: _rscset PGSQL-HA:promote AWSVIP:start symmetrical=false property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.14-70404b0 \ cluster-infrastructure=corosync \ cluster-name=cgc-staging-psql \ stonith-enabled=true \ no-quorum-policy=ignore \ last-lrm-refresh=1555497766 \ maintenance-mode=false rsc_defaults rsc-options: \ resource-stickiness=10 \ migration-threshold=1

dancica commented 5 years ago

syslog_1704.txt