Open nasjomach opened 3 years ago
IIRC, none of the stonith plugins does that, i.e. runs in a loop until the status is correct, so this would be a precedence. A question: how often do you check the status? If it's too often and the device (in this case aws) is flaky, then you may try increasing the interval.
The API bucket the agent uses is shared for the account's whole region and fairly small so simply extending the interval doesn't help much after a point.
Concerns: cluster-glue/lib/plugins/stonith/external/ec2
Seems to me that there are no retry mechanism in the EC2 OCF script. AWS EC2 API calls can be throttle if more than 10000 API request a seconds are made. In this case the script would not report any status and consider the resource in a bad status ending up with the STONITH device getting stopped.
Performing a "resource cleanup" operation starts the STONITH again in operational state after such failures.
/var/log/messages 2021-09-16T16:02:04.751248+00:00 external/ec2(res_AWS_STONITH)[31700]: info: status check for is
<-- Missing instance status report after "is" keyword
2021-09-16T16:02:04.760725+00:00 external/ec2(res_AWS_STONITH)[31694]: WARN: Already fenced (Instance status = ). Aborting fence attempt.
2021-09-16T16:02:13.742017+00:00 external/ec2(res_AWS_STONITH)[32004]: ERROR: Operation status failed: 1
Maybe some kind of fault tolerance would be nice to have I guess.