ChenHuajun / pha4pgsql

Pacemaker High Availability for PostgreSQL
GNU General Public License v3.0
62 stars 29 forks source link

writer/reader-vip 如何配置 #16

Open scarletfrank opened 3 years ago

scarletfrank commented 3 years ago

感谢大神的脚本。我打算按三节点配置配置一个PG HA的集群,跟着教程走到这一步出错了

报错如下:

[root@mdw pha4pgsql]# cls_start
resource msPostgresql is NOT running
resource msPostgresql is NOT running
resource msPostgresql is NOT running
Cleaned up pgsql:0 on sdw2
Cleaned up pgsql:0 on sdw1
Cleaned up pgsql:0 on mdw
Cleaned up pgsql:1 on sdw2
Cleaned up pgsql:1 on sdw1
Cleaned up pgsql:1 on mdw
Cleaned up pgsql:2 on sdw2
Cleaned up pgsql:2 on sdw1
Cleaned up pgsql:2 on mdw
Error: resource 'msPostgresql' is not running on any node
failed to execute "pcs resource enable msPostgresql --wait" rc=1

我尝试自己排查原因,但我感觉是我没理解一些参数的含义,所以提一个issue...主要的疑惑在于:

  1. writer-vip和reader-vip在被客户端访问时,是如何转换到具体的IP上的,以及虚拟IP参数设置上有什么要注意的吗。

  2. 我这边的情况应该是三台虚拟机(公司里的,应该是一个物理机划分出来的),各有三个ip(192.168.x.94/95/96),从ifconfig看都是ens160,这种能做三节点PG HA吗。

感觉主要的错误就在于我盲目设置writer_vip=192.168.x.100, reader_vip=192.168.x.101,但我不是很懂这里应该如何设置。我看了一下这里的issue,好像是要配置一些网关之类的?但从集群未启动的原因来看又是因为资源根本没启动,果然还是要pcs的log...我先去找下资料,回来补充

接下来是我自己的理解和情况补充说明

clusterlab的PG备份集群,里面的说明图,它把虚拟IP1设在eth0上,虚拟IP2设在eth2上。然后我看教程里的node1/2/3和writer/reader_vip 都是192.168.0.231-237,所以我在想您的教程里是不是其实访问writer_vip/reader_vip就会指向具体的node1/2/3

我之后敲了如下的命令查看状态。

# cls_status
Cluster name: pgcluster
Stack: corosync
Current DC: sdw1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Wed Nov  4 11:55:19 2020
Last change: Wed Nov  4 11:44:41 2020 by root via cibadmin on mdw

3 nodes configured
5 resources configured

Online: [ mdw sdw1 sdw2 ]

Full list of resources:

 vip-master (ocf::heartbeat:IPaddr2):   Stopped
 vip-slave  (ocf::heartbeat:IPaddr2):   Stopped
 Master/Slave Set: msPostgresql [pgsql]
     Stopped: [ mdw sdw1 sdw2 ]

Failed Resource Actions:
* pgsql_start_0 on mdw 'unknown error' (1): call=45, status=Timed Out, exitreason='',
    last-rc-change='Wed Nov  4 11:43:49 2020', queued=0ms, exec=60001ms
* pgsql_start_0 on sdw2 'unknown error' (1): call=45, status=Timed Out, exitreason='',
    last-rc-change='Wed Nov  4 11:43:49 2020', queued=0ms, exec=60001ms
* pgsql_start_0 on sdw1 'unknown error' (1): call=45, status=Timed Out, exitreason='',
    last-rc-change='Wed Nov  4 11:43:49 2020', queued=0ms, exec=60002ms

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

我用的config.ini如下

pcs_template=muti.pcs.template
OCF_ROOT=/usr/lib/ocf
RESOURCE_LIST="msPostgresql vip-master vip-slave"
pha4pgsql_dir=/opt/pha4pgsql
writer_vip=192.168.x.100 
reader_vip=192.168.x.101 
node1=mdw
node2=sdw1
node3=sdw2
othernodes=""
vip_nic=ens160
vip_cidr_netmask=24
pgsql_pgctl=/usr/pgsql-12/bin/pg_ctl
pgsql_psql=/usr/pgsql-12/bin/psql
pgsql_pgdata=/pgsql/data
pgsql_pgport=5432
pgsql_restore_command=""
pgsql_rep_mode=sync
pgsql_repuser=replication
pgsql_reppassord=replication
ChenHuajun commented 3 years ago

writer-vip和reader-vip在被客户端访问时,是如何转换到具体的IP上的,以及虚拟IP参数设置上有什么要注意的吗。 vip和实际ip并无本质区别,只是vip不是固定绑在某台机器上,而是要根据集群的状态在集群的不同机器上漂移。使用vip只要保证集群中所有节点在同一网段即可。

你遇到的问题,要看日志才能分析原因,不过好像不是vip的问题。

另外,基于pacemaker的PG HA是很早流行的。现在更推荐基于patroni的方式。搭建步骤参考下面。 etcd要搭成3节点高可用方式,可以和3台数据库搭在一起。

https://github.com/ChenHuajun/chenhuajun.github.io/blob/master/_posts/2020-09-07-%E5%9F%BA%E4%BA%8EPatroni%E7%9A%84PostgreSQL%E9%AB%98%E5%8F%AF%E7%94%A8%E7%8E%AF%E5%A2%83%E9%83%A8%E7%BD%B2.md

scarletfrank commented 3 years ago

多谢,patroni的我之后也会去实践一下。

我在和PG备份集群对比config.pcs时就很疑惑了,我们的两个虚拟IP,都放在同一个nic上;而后者,则是第一个vip用于读写(放在eth0),第二个vip用于备份(放在eth2)。是什么造成了二者的差异?

image1 image2

# 我们生成的config.pcs
pcs cluster cib pgsql_cfg

pcs -f pgsql_cfg property set no-quorum-policy="stop"
pcs -f pgsql_cfg property set stonith-enabled="false"
pcs -f pgsql_cfg resource defaults resource-stickiness="1"
pcs -f pgsql_cfg resource defaults migration-threshold="10"

pcs -f pgsql_cfg resource create vip-master IPaddr2 \
   ip="192.168.x.100" \
   nic="ens160" \
   cidr_netmask="24" \
   op start   timeout="60s" interval="0s"  on-fail="restart" \
   op monitor timeout="60s" interval="10s" on-fail="restart" \
   op stop    timeout="60s" interval="0s"  on-fail="block"

pcs -f pgsql_cfg resource create vip-slave IPaddr2 \
   ip="192.168.x.101" \
   nic="ens160" \
   cidr_netmask="24" \
   op start   timeout="60s" interval="0s"  on-fail="restart" \
   op monitor timeout="60s" interval="10s" on-fail="restart" \
   op stop    timeout="60s" interval="0s"  on-fail="block"

pcs -f pgsql_cfg resource create pgsql expgsql \
   pgctl="/usr/pgsql-12/bin/pg_ctl" \
   psql="/usr/pgsql-12/bin/psql" \
   pgdata="/pgsql/data" \
   pgport="5432" \
   rep_mode="sync" \
   node_list="mdw sdw1 sdw2 " \
   restore_command="" \
   primary_conninfo_opt="user=replication password=replication keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
   master_ip="192.168.x.100" \
   restart_on_promote="false" \
   enable_distlock="" \
   distlock_lock_cmd="/opt/pha4pgsql/tools/distlock '' lock distlock: @owner 9 12" \
   distlock_unlock_cmd="/opt/pha4pgsql/tools/distlock '' unlock distlock: @owner" \
   distlock_lockservice_deadcheck_nodelist="mdw sdw1 sdw2 " \
   op start   timeout="60s" interval="0s"  on-fail="restart" \
   op monitor timeout="60s" interval="4s" on-fail="restart" \
   op monitor timeout="60s" interval="3s"  on-fail="restart" role="Master" \
   op promote timeout="60s" interval="0s"  on-fail="restart" \
   op demote  timeout="60s" interval="0s"  on-fail="stop" \
   op stop    timeout="60s" interval="0s"  on-fail="block" \
   op notify  timeout="60s" interval="0s"

pcs -f pgsql_cfg resource master msPostgresql pgsql \
   master-max=1 master-node-max=1 clone-node-max=1 notify=true \
   migration-threshold="3" target-role="Master"

pcs -f pgsql_cfg constraint colocation add vip-master with Master msPostgresql INFINITY
pcs -f pgsql_cfg constraint order promote msPostgresql then start vip-master symmetrical=false score=INFINITY
pcs -f pgsql_cfg constraint order demote  msPostgresql then stop  vip-master symmetrical=false score=0

pcs -f pgsql_cfg constraint colocation add vip-slave with Slave msPostgresql INFINITY
pcs -f pgsql_cfg constraint order promote  msPostgresql then start vip-slave symmetrical=false score=INFINITY
pcs -f pgsql_cfg constraint order stop msPostgresql then stop vip-slave symmetrical=false score=0

pcs -f pgsql_cfg constraint location  vip-slave rule id="loc-vip-slave-rule" score=1000 master-pgsql eq "HS:sync"

pcs cluster cib-push pgsql_cfg
# clusterlab的config.pcs
pcs cluster cib pgsql_cfg

pcs -f pgsql_cfg property set no-quorum-policy="ignore"
pcs -f pgsql_cfg property set stonith-enabled="false"
pcs -f pgsql_cfg resource defaults resource-stickiness="INFINITY"
pcs -f pgsql_cfg resource defaults migration-threshold="1"

pcs -f pgsql_cfg resource create vip-master IPaddr2 \
   ip="192.168.0.3" \
   nic="eth0" \
   cidr_netmask="24" \
   op start   timeout="60s" interval="0s"  on-fail="restart" \
   op monitor timeout="60s" interval="10s" on-fail="restart" \
   op stop    timeout="60s" interval="0s"  on-fail="block"

pcs -f pgsql_cfg resource create vip-rep IPaddr2 \
   ip="192.168.2.3" \
   nic="eth2" \
   cidr_netmask="24" \
   meta migration-threshold="0" \
   op start   timeout="60s" interval="0s"  on-fail="stop" \
   op monitor timeout="60s" interval="10s" on-fail="restart" \
   op stop    timeout="60s" interval="0s"  on-fail="ignore"

pcs -f pgsql_cfg resource create pgsql pgsql \
   pgctl="/usr/bin/pg_ctl" \
   psql="/usr/bin/psql" \
   pgdata="/var/lib/pgsql/data/" \
   rep_mode="sync" \
   node_list="node1 node2" \
   restore_command="cp /var/lib/pgsql/pg_archive/%f %p" \
   primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
   master_ip="192.168.2.3" \
   restart_on_promote='true' \
   op start   timeout="60s" interval="0s"  on-fail="restart" \
   op monitor timeout="60s" interval="4s" on-fail="restart" \
   op monitor timeout="60s" interval="3s"  on-fail="restart" role="Master" \
   op promote timeout="60s" interval="0s"  on-fail="restart" \
   op demote  timeout="60s" interval="0s"  on-fail="stop" \
   op stop    timeout="60s" interval="0s"  on-fail="block" \
   op notify  timeout="60s" interval="0s"

pcs -f pgsql_cfg resource master msPostgresql pgsql \
   master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

pcs -f pgsql_cfg resource group add master-group vip-master vip-rep

pcs -f pgsql_cfg constraint colocation add master-group with Master msPostgresql INFINITY
pcs -f pgsql_cfg constraint order promote msPostgresql then start master-group symmetrical=false score=INFINITY
pcs -f pgsql_cfg constraint order demote  msPostgresql then stop  master-group symmetrical=false score=0

pcs cluster push cib pgsql_cfg
ChenHuajun commented 3 years ago

"192.168.x.100" ip地址里怎么有一个x ?

我在和PG备份集群对比config.pcs时就很疑惑了,我们的两个虚拟IP,都放在同一个nic上;而后者,则是第一个vip用于读写(放在eth0),第二个vip用于备份(放在eth2)。是什么造成了二者的差异?

PgSQL_Replicated_Cluster里的2个vip都绑在master上,这个环境里机器有个独立的网络,通过2个vip隔离master的网络流量(业务流程 + 备份复制流程)。

pcs -f pgsql_cfg resource group add master-group vip-master vip-rep

pha4pgsql里的2个vip其中一个作为读写vip绑在master上,另一个作为只读vip绑在从上,用于可以利用只读vip做读写分离。如果应用没有读写分离需求,也可以不绑只读vip。

scarletfrank commented 3 years ago

"192.168.x.100" ip地址里怎么有一个x ?

我在和PG备份集群对比config.pcs时就很疑惑了,我们的两个虚拟IP,都放在同一个nic上;而后者,则是第一个vip用于读写(放在eth0),第二个vip用于备份(放在eth2)。是什么造成了二者的差异?

PgSQL_Replicated_Cluster里的2个vip都绑在master上,这个环境里机器有个独立的网络,通过2个vip隔离master的网络流量(业务流程 + 备份复制流程)。

pcs -f pgsql_cfg resource group add master-group vip-master vip-rep

pha4pgsql里的2个vip其中一个作为读写vip绑在master上,另一个作为只读vip绑在从上,用于可以利用只读vip做读写分离。如果应用没有读写分离需求,也可以不绑只读vip。

x 只是某个特定数字,想了想其实也没什么遮掉的必要...都是一致的 我这里定的都是同一个,比如192.168.199.94 - 96 ,然后192.168.199.100 - 101 这个样子

Thanks,我的应用需要做读写分离。