NHSDigital / terraform-aws-mesh-client

Reference implementation of a MESH Client in an AWS environment only using serverless technologies.
MIT License
2 stars 1 forks source link

ThrottlingException when calling the DescribeExecution operation #123

Open bob-walker-nhs opened 8 months ago

bob-walker-nhs commented 8 months ago

If multiple files are written to a MESH outbound bucket in quick succession, we see these errors in in mesh_check_send_parameters, specifically from where it calls sfn.describe_execution(executionArn=execution_arn)):

{
  "errorMessage": "An error occurred (ThrottlingException) when calling the DescribeExecution operation (reached max retries: 4): Rate exceeded",
  "errorType": "ClientError",
  "requestId": "62ed8b47-518d-4db5-8b45-82e1b9670d9c",
  "stackTrace": [
    "  File \"/var/task/mesh_check_send_parameters_application.py\", line 116, in lambda_handler\n    return app.main(event, context)\n",
    "  File \"/opt/python/spine_aws_common/lambda_application.py\", line 74, in main\n    raise e\n",
    "  File \"/opt/python/spine_aws_common/lambda_application.py\", line 57, in main\n    self.start()\n",
    "  File \"/var/task/mesh_check_send_parameters_application.py\", line 58, in start\n    singleton_check(\n",
    "  File \"/var/task/shared/common.py\", line 67, in singleton_check\n    ex_response = sfn.describe_execution(executionArn=execution_arn)\n",
    "  File \"/opt/python/botocore/client.py\", line 553, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n",
    "  File \"/opt/python/botocore/client.py\", line 1009, in _make_api_call\n    raise error_class(parsed_response, operation_name)\n"
  ]
} 

This is _presumably because multiple EventBridge rules fired at the same time and trigger multiple concurrent Step Function invocations.

Suggestion from @matt-mercer that this should maybe handled tactically with a backoff/retry but strategically by replacing the API call with a lookup to a lock table in Dynamo DB (apologies if I'm misrepresenting what you said Matt :-D).

JIRA Ref: https://nhsd-jira.digital.nhs.uk/browse/PCRM-8920

davidhallam4-nhs commented 7 months ago

Picking up this ticket