Closed dsalaza4 closed 3 months ago
I prefer to hide this setting as it is an internal function. Will set this to 90s in a first attempt to allow the function to complete.
Could you please add some details, e.g. where the function dies? Do you have some logs available? I checked my functions and didn't see any errors.
Hi @kayman-mk,
Some of our runners handle hundreds of workers, that is why the function times out before being able to finish properly.
I think Increasing it to 90 seconds would also work, but might break in the future if we keep scaling the number of workers a given runner handles.
Here are some logs where you can see how the lambda is terminated due to timeout before it completes. It basically runs out of time before actually finding all orphaned instances after a runner restart. The uncomfortable part is that the lambda does not reach the phase where it removes the orphaned workers, forcing us to them delete manually.
{
"Level": "info",
"InstanceId": "i-0a17a06248396aa56",
"Name": "runner-dwzkch7g-ci-worker-integrates-1721263364-81d47cf4",
"LaunchTime": "2024-07-18 00:42:45+00:00",
"Message": "i-0a17a06248396aa56 appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}
{
"Level": "info",
"InstanceId": "i-0533a7ea72e005f2f",
"Name": null,
"LaunchTime": "2024-07-18 00:42:23+00:00",
"Message": "i-0533a7ea72e005f2f appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}
{
"Level": "info",
"InstanceId": "i-0a748b49121c4406f",
"Name": "runner-dwzkch7g-ci-worker-integrates-1721263415-481629be",
"LaunchTime": "2024-07-18 00:43:36+00:00",
"Message": "i-0a748b49121c4406f appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}
{
"Level": "info",
"InstanceId": "i-0bbb228f5f099fa78",
"Name": null,
"LaunchTime": "2024-07-18 00:40:30+00:00",
"Message": "i-0bbb228f5f099fa78 appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}
{
"Level": "info",
"InstanceId": "i-0a59ed6dda58d0906",
"Name": "runner-dwzkch7g-ci-worker-integrates-1721263271-6d116ac2",
"LaunchTime": "2024-07-18 00:41:13+00:00",
"Message": "i-0a59ed6dda58d0906 appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}
{
"Level": "info",
"InstanceId": "i-03298502a1fff63d5",
"Name": "runner-dwzkch7g-ci-worker-integrates-1721263167-99123ab8",
"LaunchTime": "2024-07-18 00:39:29+00:00",
"Message": "i-03298502a1fff63d5 appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}
{
"Level": "info",
"InstanceId": "i-05b1f55473aebf29c",
"Name": "runner-mfldqmu8-ci-worker-common-1721263197-c3d32ad5",
"LaunchTime": "2024-07-18 00:39:59+00:00",
"Message": "i-05b1f55473aebf29c appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}
2024-07-18T01:01:04.710Z 7912163d-7283-4050-9136-f94fe43a1d1a Task timed out after 30.08 seconds
END RequestId: 7912163d-7283-4050-9136-f94fe43a1d1a
REPORT RequestId: 7912163d-7283-4050-9136-f94fe43a1d1a Duration: 30077.37 ms Billed Duration: 30000 ms Memory Size: 128 MB Max Memory Used: 90 MB
INIT_START Runtime Version: python:3.11.v37 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:7922becca3524b6503b11e59f33405b21aca377582c11bd54243092ba71b57d5
Urgh, that's not funny. Do you have the time to propose a PR?
Describe the solution you'd like
It looks like the
terminate-agent-hook
lambda timeout is hardcoded to 30 seconds.https://github.com/cattle-ops/terraform-aws-gitlab-runner/blob/321efff8365ae7ac3f604d56561f186958d0d608/modules/terminate-agent-hook/main.tf#L39
This makes the lambda stop abruptly when there are many workers turned on.
Suggest a solution
Allow passing
terminate_agent_hook_lambda_timeout
as an argument so users can specify whatever they think is reasonable.