jayan800 commented 1 year ago

Hi All,

We are running a two node pacemaker cluster in AWS and we use "awsvip" resource type to configure the vip IP. Below is the conf

pcs resource show privip_node1

Resource: privip_node1 (class=ocf provider=heartbeat type=awsvip) Attributes: secondary_private_ip=10.x.x.x Operations: migrate_from interval=0s timeout=30s (privip_node1-migrate_from-interval-0s) migrate_to interval=0s timeout=30s (privip_node1-migrate_to-interval-0s) monitor interval=20s timeout=30s (privip_node1-monitor-interval-20s) start interval=0s timeout=30s (privip_node1-start-interval-0s) stop interval=0s timeout=30s (privip_node1-stop-interval-0s) validate interval=0s timeout=10s (privip_node1-validate-interval-0s)

pcs resource show node1_vip

Resource: node1_vip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.x.x.x Operations: monitor interval=10s timeout=20s (node1_vip-monitor-interval-10s) start interval=0s timeout=20s (node1_vip-start-interval-0s) stop interval=0s timeout=20s (node1_vip-stop-interval-0s)

The EC2 instance is configured to use IMDSV2.The fence_aws agent and resource-agent have also been upgraded to the most recent versions, which support imdsv2. Additionally, the resource is set up to use the IAM Profile credentials.

fence-agents-aws-4.2.1-41.el7_9.3.x86_64 python-s3transfer-0.1.13-1.0.1.el7.noarch resource-agents-4.1.1-61.el7_9.15.x86_64

pip list | grep -i boto boto3 (1.10.0) botocore (1.13.50)

aws --version aws-cli/2.9.4 Python/3.9.11 Linux/3.10.0-1160.80.1.0.1.el7.x86_64 exe/x86_64.oracle.7 prompt/off

pip3 list | grep -i boto boto3 1.23.10 botocore 1.26.10

The privip resource consistently fails with the different errors:

pengine: warning: unpack_rsc_op_failure: Processing failed monitor of privip_node2 on node2: unknown error | rc=1 Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000 process (PID 109357) timed out Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000 process (PID 109357) timed out Apr 13 11:09:54 node2 lrmd[3773]: warning: privip_node2_monitor_20000:109357 - timed out after 30000ms

Jun 16 10:01:43 node2 lrmd[36967]: notice: privip_node2_monitor_20000:13042:stderr [ Unable to locate credentials. You can configure credentials by running "aws configure". ] Jun 16 10:01:43 node2 crmd[36970]: notice: privip_node2_monitor_20000:91 [ % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r100 359 100 359 0 0 37513 0 --:--:-- --:--:-- --:--:-- 39888\n\nUnable to locate credentials. You can configure credentials by running "aws configure".\n ]

Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ #015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed connect to 169.254.169.254:80; Connection refused ] Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ #015 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed connect to 169.254.169.254:80; Connection refused ] Jun 22 10:10:10 node1 lrmd[12465]: notice: privip_node1_monitor_20000:105561:stderr [ An error occurred (MissingParameter) when calling the DescribeInstances operation: The request must contain the parameter InstanceId ]

Failed Resource Actions:

privip_node1_start_0 on node1 'not running' (7): call=250, status=complete, exitreason='instance_id not found. Is this a EC2 instance?', last-rc-change='Fri May 26 07:27:46 2023', queued=0ms, exec=6597ms

Any advice would be great.

oalbrigt commented 1 year ago

Try running pcs resource debug-start --full <resource>. That should show you all the commands it's running, and hopefully some pointers to what's wrong.

jayan800 commented 1 year ago

Thank you.

The debug command completed without any errors.

is there anything else to check?

oalbrigt commented 1 year ago

You can run pcs resource update <resource> trace_ra=1 and then disable/enable or restart the resource.

The trace files will be available in /var/lib//heartbeat/trace_ra/.

jayan800 commented 1 year ago

Thank you. I will enable the trace. fingers crossed

ClusterLabs / resource-agents

AWS Pacemaker awsvip failing with different errors #1876

pcs resource show privip_node1

pcs resource show node1_vip