If multiple files are written to a MESH outbound bucket in quick succession, we see these errors in in mesh_check_send_parameters, specifically from where it calls sfn.describe_execution(executionArn=execution_arn)):
{
"errorMessage": "An error occurred (ThrottlingException) when calling the DescribeExecution operation (reached max retries: 4): Rate exceeded",
"errorType": "ClientError",
"requestId": "62ed8b47-518d-4db5-8b45-82e1b9670d9c",
"stackTrace": [
" File \"/var/task/mesh_check_send_parameters_application.py\", line 116, in lambda_handler\n return app.main(event, context)\n",
" File \"/opt/python/spine_aws_common/lambda_application.py\", line 74, in main\n raise e\n",
" File \"/opt/python/spine_aws_common/lambda_application.py\", line 57, in main\n self.start()\n",
" File \"/var/task/mesh_check_send_parameters_application.py\", line 58, in start\n singleton_check(\n",
" File \"/var/task/shared/common.py\", line 67, in singleton_check\n ex_response = sfn.describe_execution(executionArn=execution_arn)\n",
" File \"/opt/python/botocore/client.py\", line 553, in _api_call\n return self._make_api_call(operation_name, kwargs)\n",
" File \"/opt/python/botocore/client.py\", line 1009, in _make_api_call\n raise error_class(parsed_response, operation_name)\n"
]
}
This is _presumably because multiple EventBridge rules fired at the same time and trigger multiple concurrent Step Function invocations.
Suggestion from @matt-mercer that this should maybe handled tactically with a backoff/retry but strategically by replacing the API call with a lookup to a lock table in Dynamo DB (apologies if I'm misrepresenting what you said Matt :-D).
If multiple files are written to a MESH outbound bucket in quick succession, we see these errors in in
mesh_check_send_parameters
, specifically from where it callssfn.describe_execution(executionArn=execution_arn))
:This is _presumably because multiple EventBridge rules fired at the same time and trigger multiple concurrent Step Function invocations.
Suggestion from @matt-mercer that this should maybe handled tactically with a backoff/retry but strategically by replacing the API call with a lookup to a lock table in Dynamo DB (apologies if I'm misrepresenting what you said Matt :-D).
JIRA Ref: https://nhsd-jira.digital.nhs.uk/browse/PCRM-8920