ec2: handle API errors while running describe-instances

Thr3d commented 2 years ago

The EC2 stonith resource uses AWS API acction DescribeInstances to check instance status. DescribeInstances has a max bucket size of 50 and refills at 10 tokens per second which is shared for the account per-Region. Because of this small bucket the stonith resource can easily hit RequestLimitExceeded errors and fail.

PR addresses this by allowing DescribeInstances operations to retry until they succeed or the resource timeout is reached when hitting RequestLimitExceeded. It also removes the curl status spam from the logs while still logging any errors.

Current failure example:

external/ec2(res_AWS_STONITH)[21300]: [21313]: info: status check for i-0661e2e11d6918617 is 
external/ec2(res_AWS_STONITH)[21294]: [21319]: WARN: Already fenced (Instance status = ). Aborting fence attempt.
stonith: external_status: 'ec2 status' failed with rc 1
stonith: external/ec2 device not accessible.
pacemaker-fenced[1974]:  warning: fence_legacy[21287] stderr: [   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current ]
pacemaker-fenced[1974]:  warning: fence_legacy[21287] stderr: [                                  Dload  Upload   Total   Spent    Left  Speed ]
pacemaker-fenced[1974]:  warning: fence_legacy[21287] stderr: [ #015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0#015100    19  100    19    0     0  19000      0 --:--:-- --:--:-- --:--:-- 19000 ]
pacemaker-fenced[1974]:  warning: fence_legacy[21287] stderr: [  ]
pacemaker-fenced[1974]:  warning: fence_legacy[21287] stderr: [ An error occurred (RequestLimitExceeded) when calling the DescribeInstances operation (reached max retries: 2): Request limit exceeded. ]

gguifelixamz commented 2 years ago

Hi @Thr3d,

Thank you for this submission and feature.

AWS CLI v1 and v2 have an embedded retry mechanism that covers the RequestLimitExceeded handling you've included.

Isn't the AWS CLI throttling handler enough for this case? If yes, then I would suggest to add a sleep/timer between calls to prevent being throttled in first place, and then rely on the AWS CLI to handle the throttling and retries.

Thank you!

Thr3d commented 2 years ago

Somehow I missed that when originally looking into retries for this...thanks for pointing that out. It sounds like that should accomplish the same goal. Will test that out.

I would suggest to add a sleep/timer between calls to prevent being throttled in first place

The bucket is shared for the whole AWS account in that region rather than per instance which means some can't avoid being throttled by changing the clusters behavior. The CLI retry backoff should help handle this on its own too.

On a side note, should I make a new PR with just the curl --silent --show-error change to remove the status spam?

ClusterLabs / cluster-glue

ec2: handle API errors while running describe-instances #35