cattle-ops / terraform-aws-gitlab-runner

Terraform module for AWS GitLab runners on ec2 (spot) instances
https://registry.terraform.io/modules/cattle-ops/gitlab-runner/aws
MIT License
586 stars 331 forks source link

Allow setting timeout for terminate-agent-hook lambda #1149

Closed dsalaza4 closed 3 months ago

dsalaza4 commented 4 months ago

Describe the solution you'd like

It looks like the terminate-agent-hook lambda timeout is hardcoded to 30 seconds.

https://github.com/cattle-ops/terraform-aws-gitlab-runner/blob/321efff8365ae7ac3f604d56561f186958d0d608/modules/terminate-agent-hook/main.tf#L39

This makes the lambda stop abruptly when there are many workers turned on.

Suggest a solution

Allow passing terminate_agent_hook_lambda_timeout as an argument so users can specify whatever they think is reasonable.

kayman-mk commented 4 months ago

I prefer to hide this setting as it is an internal function. Will set this to 90s in a first attempt to allow the function to complete.

Could you please add some details, e.g. where the function dies? Do you have some logs available? I checked my functions and didn't see any errors.

dsalaza4 commented 4 months ago

Hi @kayman-mk,

Some of our runners handle hundreds of workers, that is why the function times out before being able to finish properly.

I think Increasing it to 90 seconds would also work, but might break in the future if we keep scaling the number of workers a given runner handles.

Here are some logs where you can see how the lambda is terminated due to timeout before it completes. It basically runs out of time before actually finding all orphaned instances after a runner restart. The uncomfortable part is that the lambda does not reach the phase where it removes the orphaned workers, forcing us to them delete manually.

{
    "Level": "info",
    "InstanceId": "i-0a17a06248396aa56",
    "Name": "runner-dwzkch7g-ci-worker-integrates-1721263364-81d47cf4",
    "LaunchTime": "2024-07-18 00:42:45+00:00",
    "Message": "i-0a17a06248396aa56 appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}

{
    "Level": "info",
    "InstanceId": "i-0533a7ea72e005f2f",
    "Name": null,
    "LaunchTime": "2024-07-18 00:42:23+00:00",
    "Message": "i-0533a7ea72e005f2f appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}

{
    "Level": "info",
    "InstanceId": "i-0a748b49121c4406f",
    "Name": "runner-dwzkch7g-ci-worker-integrates-1721263415-481629be",
    "LaunchTime": "2024-07-18 00:43:36+00:00",
    "Message": "i-0a748b49121c4406f appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}

{
    "Level": "info",
    "InstanceId": "i-0bbb228f5f099fa78",
    "Name": null,
    "LaunchTime": "2024-07-18 00:40:30+00:00",
    "Message": "i-0bbb228f5f099fa78 appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}

{
    "Level": "info",
    "InstanceId": "i-0a59ed6dda58d0906",
    "Name": "runner-dwzkch7g-ci-worker-integrates-1721263271-6d116ac2",
    "LaunchTime": "2024-07-18 00:41:13+00:00",
    "Message": "i-0a59ed6dda58d0906 appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}

{
    "Level": "info",
    "InstanceId": "i-03298502a1fff63d5",
    "Name": "runner-dwzkch7g-ci-worker-integrates-1721263167-99123ab8",
    "LaunchTime": "2024-07-18 00:39:29+00:00",
    "Message": "i-03298502a1fff63d5 appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}

{
    "Level": "info",
    "InstanceId": "i-05b1f55473aebf29c",
    "Name": "runner-mfldqmu8-ci-worker-common-1721263197-c3d32ad5",
    "LaunchTime": "2024-07-18 00:39:59+00:00",
    "Message": "i-05b1f55473aebf29c appears to be orphaned. Parent runner i-0755e2eb7782c2ff2 is terminated."
}

2024-07-18T01:01:04.710Z 7912163d-7283-4050-9136-f94fe43a1d1a Task timed out after 30.08 seconds

END RequestId: 7912163d-7283-4050-9136-f94fe43a1d1a
REPORT RequestId: 7912163d-7283-4050-9136-f94fe43a1d1a  Duration: 30077.37 ms   Billed Duration: 30000 ms   Memory Size: 128 MB Max Memory Used: 90 MB  
INIT_START Runtime Version: python:3.11.v37 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:7922becca3524b6503b11e59f33405b21aca377582c11bd54243092ba71b57d5
kayman-mk commented 3 months ago

Urgh, that's not funny. Do you have the time to propose a PR?