buildkite / buildkite-agent-scaler

📈A lambda for scaling an AutoScalingGroup based on Buildkite metrics
MIT License
61 stars 27 forks source link

AccessDenied calling autoscaling:DescribeScalingActivities #68

Closed kwong-chong closed 2 years ago

kwong-chong commented 2 years ago

Attempts by the autoscaling function to call autoscaling:DescribeScalingActivities result in an AccessDenied error.

Cloudwatch Logs of the autoscaling function where this problem is seen:

START RequestId: 8f114af1-b397-4e2f-a551-48a80c2f17d1 Version: $LATEST
2022/07/28 06:01:52 buildkite-agent-scaler version 1.3.1 dev
2022/07/28 06:01:52 Failed to retrieve last scaling activity events due to error (AccessDenied: User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/buildkite-5-11-0-test-Autoscaling-CV-ExecutionRole-X4NRVJKLN5LX/buildkite-5-11-0-test-Autoscal-AutoscalingFunction-eAfNzknpQWKz is not authorized to perform: autoscaling:DescribeScalingActivities because no identity-based policy allows the autoscaling:DescribeScalingActivities action
status code: 403, request id: b14c2b04-8814-4844-99bc-bed20fc4f94b)
2022/07/28 06:01:52 Publishing cloudwatch metrics
2022/07/28 06:01:52 Disabling scale-in 🙅🏼‍
2022/07/28 06:01:52 Collecting Buildkite metrics for queue "buildkite-5-11-0-test"
2022/07/28 06:01:52 ↳ Agents: idle=0, busy=0, total=0
2022/07/28 06:01:52 ↳ Jobs: scheduled=0, running=0, waiting=0 (took 227.200934ms)
2022/07/28 06:01:52 Publishing metric RunningJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:01:52 Publishing metric WaitingJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:01:52 Publishing metric ScheduledJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:01:52 Collecting AutoScaling details for ASG "buildkite-5-11-0-test-AgentAutoScaleGroup-1JP1337S0PGK5"
2022/07/28 06:01:53 ↳ Got pending=0, desired=0, min=0, max=10 (took 97.906699ms)
2022/07/28 06:01:53 Calculating desired instance count for Buildkite Jobs
2022/07/28 06:01:53 ↳ 🧮 Agents required 0, Instances required 0
2022/07/28 06:01:53 No scaling required, currently 0
2022/07/28 06:01:53 Waiting for 10s
2022/07/28 06:02:03 Publishing cloudwatch metrics
2022/07/28 06:02:03 Disabling scale-in 🙅🏼‍
2022/07/28 06:02:03 Collecting Buildkite metrics for queue "buildkite-5-11-0-test"
2022/07/28 06:02:03 ↳ Agents: idle=0, busy=0, total=0
2022/07/28 06:02:03 ↳ Jobs: scheduled=0, running=0, waiting=0 (took 266.151543ms)
2022/07/28 06:02:03 Publishing metric ScheduledJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:03 Publishing metric RunningJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:03 Publishing metric WaitingJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:03 Collecting AutoScaling details for ASG "buildkite-5-11-0-test-AgentAutoScaleGroup-1JP1337S0PGK5"
2022/07/28 06:02:03 ↳ Got pending=0, desired=0, min=0, max=10 (took 126.569585ms)
2022/07/28 06:02:03 Calculating desired instance count for Buildkite Jobs
2022/07/28 06:02:03 ↳ 🧮 Agents required 0, Instances required 0
2022/07/28 06:02:03 No scaling required, currently 0
2022/07/28 06:02:03 Waiting for 10s
2022/07/28 06:02:13 Publishing cloudwatch metrics
2022/07/28 06:02:13 Disabling scale-in 🙅🏼‍
2022/07/28 06:02:13 Collecting Buildkite metrics for queue "buildkite-5-11-0-test"
2022/07/28 06:02:13 ↳ Agents: idle=0, busy=0, total=0
2022/07/28 06:02:13 ↳ Jobs: scheduled=0, running=0, waiting=0 (took 231.416245ms)
2022/07/28 06:02:13 Publishing metric ScheduledJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:13 Publishing metric RunningJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:13 Publishing metric WaitingJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:13 Collecting AutoScaling details for ASG "buildkite-5-11-0-test-AgentAutoScaleGroup-1JP1337S0PGK5"
2022/07/28 06:02:13 ↳ Got pending=0, desired=0, min=0, max=10 (took 115.393314ms)
2022/07/28 06:02:13 Calculating desired instance count for Buildkite Jobs
2022/07/28 06:02:13 ↳ 🧮 Agents required 0, Instances required 0
2022/07/28 06:02:13 No scaling required, currently 0
2022/07/28 06:02:13 Waiting for 10s
2022/07/28 06:02:24 Publishing cloudwatch metrics
2022/07/28 06:02:24 Disabling scale-in 🙅🏼‍
2022/07/28 06:02:24 Collecting Buildkite metrics for queue "buildkite-5-11-0-test"
2022/07/28 06:02:24 ↳ Agents: idle=0, busy=0, total=0
2022/07/28 06:02:24 ↳ Jobs: scheduled=0, running=0, waiting=0 (took 219.99187ms)
2022/07/28 06:02:24 Publishing metric ScheduledJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:24 Publishing metric RunningJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:24 Publishing metric WaitingJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:24 Collecting AutoScaling details for ASG "buildkite-5-11-0-test-AgentAutoScaleGroup-1JP1337S0PGK5"
2022/07/28 06:02:24 ↳ Got pending=0, desired=0, min=0, max=10 (took 91.938495ms)
2022/07/28 06:02:24 Calculating desired instance count for Buildkite Jobs
2022/07/28 06:02:24 ↳ 🧮 Agents required 0, Instances required 0
2022/07/28 06:02:24 No scaling required, currently 0
2022/07/28 06:02:24 Waiting for 10s
2022/07/28 06:02:34 Publishing cloudwatch metrics
2022/07/28 06:02:34 Disabling scale-in 🙅🏼‍
2022/07/28 06:02:34 Collecting Buildkite metrics for queue "buildkite-5-11-0-test"
2022/07/28 06:02:34 ↳ Agents: idle=0, busy=0, total=0
2022/07/28 06:02:34 ↳ Jobs: scheduled=0, running=0, waiting=0 (took 220.285862ms)
2022/07/28 06:02:34 Publishing metric WaitingJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:34 Publishing metric ScheduledJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:34 Publishing metric RunningJobsCount=0 [org=our-org,queue=buildkite-5-11-0-test]
2022/07/28 06:02:34 Collecting AutoScaling details for ASG "buildkite-5-11-0-test-AgentAutoScaleGroup-1JP1337S0PGK5"
2022/07/28 06:02:34 ↳ Got pending=0, desired=0, min=0, max=10 (took 109.298474ms)
2022/07/28 06:02:34 Calculating desired instance count for Buildkite Jobs
2022/07/28 06:02:34 ↳ 🧮 Agents required 0, Instances required 0
2022/07/28 06:02:34 No scaling required, currently 0
2022/07/28 06:02:34 Waiting for 10s
END RequestId: 8f114af1-b397-4e2f-a551-48a80c2f17d1
REPORT RequestId: 8f114af1-b397-4e2f-a551-48a80c2f17d1  Duration: 52242.15 ms   Billed Duration: 52243 ms   Memory Size: 128 MB Max Memory Used: 49 MB  

This also ends up in CloudTrail

    "errorCode": "AccessDenied",
    "errorMessage": "User: arn:aws:sts::XXXXXXXXXXXX:assumed-role/buildkite-5-11-0-test-Autoscaling-CV-ExecutionRole-X4NRVJKLN5LX/buildkite-5-11-0-test-Autoscal-AutoscalingFunction-eAfNzknpQWKz is not authorized to perform: autoscaling:DescribeScalingActivities because no identity-based policy allows the autoscaling:DescribeScalingActivities action",

autoscaling:DescribeScalingActivities appears to be where this is called.

Perhaps this section is missing permissions.

Not sure how it affects the functionality of the agent; it doesn't seem to complain about anything else.

The problem for us though, is the frequency of AccessDenied's into CloudTrail is triggering alerts. Unless we can suppress these (looking into it), we will not be able to upgrade our agents to use versions which utilise versions of the agent scaler which contain this bug.

Parameters used to create the stack:

Parameters (56)

AgentEnvFileUrl -   -
AgentsPerInstance   1   -
ArtifactsBucket -   -
AssociatePublicIpAddress    false   -
AuthorizedUsersUrl  -   -
AvailabilityZones   -   -
BootstrapScriptUrl  -   -
BuildkiteAdditionalSudoPermissions  -   -
BuildkiteAgentExperiments   -   -
BuildkiteAgentRelease   stable  -
BuildkiteAgentTags  -   -
BuildkiteAgentTimestampLines    false   -
BuildkiteAgentToken ****    -
BuildkiteAgentTokenParameterStoreKMSKey alias/aws/ssm   -
BuildkiteAgentTokenParameterStorePath   /buildkite/token    -
BuildkiteAgentTracingBackend    -   -
BuildkiteQueue  buildkite-5-11-0-test   -
BuildkiteTerminateInstanceAfterJob  false   -
BuildkiteWindowsAdministrator   true    -
CostAllocationTagName   CreatedBy   -
CostAllocationTagValue  buildkite-elastic-ci-stack-for-aws  -
ECRAccessPolicy none    -
EnableAgentGitMirrorsExperiment false   -
EnableCostAllocationTags    false   -
EnableDetailedMonitoring    false   -
EnableDockerExperimental    false   -
EnableDockerLoginPlugin true    -
EnableDockerUserNamespaceRemap  true    -
EnableECRPlugin true    -
EnableInstanceStorage   false   -
EnableSecretsPlugin true    -
IMDSv2Tokens    optional    -
ImageId -   -
ImageIdParameter    -   -
InstanceCreationTimeout -   -
InstanceOperatingSystem linux   -
InstanceRoleName    -   -
InstanceRolePermissionsBoundaryARN  -   -
InstanceType    t3.large    -
KeyName -   -
ManagedPolicyARN    -   -
MaxSize 10  -
MinSize 0   -
OnDemandPercentage  0   -
RootVolumeName  -   -
RootVolumeSize  250 -
RootVolumeType  gp3 -
ScaleInIdlePeriod   600 -
ScaleOutFactor  1.0 -
ScaleOutForWaitingJobs  false   -
SecretsBucket   -   -
SecretsBucketRegion -   -
SecurityGroupId -   -
SpotPrice   0   -
Subnets subnet-0afc0752fdd7f5059,subnet-0d71649fa185a56e3,subnet-0a51450b4eea6938b  -
VpcId   vpc-01c4e97f8a4899c85   -

Thanks, Kwong.

moskyb commented 2 years ago

Related: https://github.com/buildkite/buildkite-agent-scaler/pull/61

I'll merge that and make a deploy, should solve the issue.