nasjomach commented 3 years ago

Concerns: cluster-glue/lib/plugins/stonith/external/ec2

Seems to me that there are no retry mechanism in the EC2 OCF script. AWS EC2 API calls can be throttle if more than 10000 API request a seconds are made. In this case the script would not report any status and consider the resource in a bad status ending up with the STONITH device getting stopped.

Performing a "resource cleanup" operation starts the STONITH again in operational state after such failures.

/var/log/messages 2021-09-16T16:02:04.751248+00:00 external/ec2(res_AWS_STONITH)[31700]: info: status check for is <-- Missing instance status report after "is" keyword

2021-09-16T16:02:04.760725+00:00 external/ec2(res_AWS_STONITH)[31694]: WARN: Already fenced (Instance status = ). Aborting fence attempt. 2021-09-16T16:02:13.742017+00:00 external/ec2(res_AWS_STONITH)[32004]: ERROR: Operation status failed: 1

Maybe some kind of fault tolerance would be nice to have I guess.

dmuhamedagic commented 3 years ago

IIRC, none of the stonith plugins does that, i.e. runs in a loop until the status is correct, so this would be a precedence. A question: how often do you check the status? If it's too often and the device (in this case aws) is flaky, then you may try increasing the interval.

Thr3d commented 2 years ago

35 Addresses this.

The API bucket the agent uses is shared for the account's whole region and fairly small so simply extending the interval doesn't help much after a point.

ClusterLabs / cluster-glue

ec2 ocf resource retry #33

35 Addresses this.