buildkite / buildkite-agent-scaler

📈A lambda for scaling an AutoScalingGroup based on Buildkite metrics
MIT License
61 stars 27 forks source link

ScalingCalculator should use live agents rather than configured AgentsPerInstance #84

Open HongboDu-at opened 1 year ago

HongboDu-at commented 1 year ago

Situation

There is one host, which has 10 agents. When we gracefully terminate the host, idle agents self-terminate immediately but busy agents will finish the job and then self-terminate. This process could take as long as the when job finishes. ScalingCalculator behaves wrong in this situation. https://github.com/buildkite/buildkite-agent-scaler/blob/6fdf0bc30b25050b25b8096e22ff72aeb790df29/scaler/scaler.go#L66

Current behavior

ScalingCalculator thinks there are 10 agents available. But in reality, only busy agents are still alive. No new jobs will get agents to be worked on.

Expected behavior

ScalingCalculator should use live agents rather than configured AgentsPerInstance

triarius commented 1 year ago

Hi @HongboDu-at, thanks for using the buildkite-agent-scaler.

This is an issue that quite a few customers are experiencing. Unfortunately, the metrics endpoint in the API does not expose information about which agents are on which hosts. It may be possible to reconstruct this information by equipping the scaler with a GraphQL token and extracting it from the GraphQL API.

Unfortunately, we don't have any work planned to do this at the moment, but we would welcome any PRs. I suggest making supplying the GraphQL (and therefore this functionality) optional as it is not needed by an Elastic Stack at the moment, and has a very broad scope.

ilyakruchinin commented 5 months ago

We had to fork the repo and updated the scaling calculator to achieve this. We will look into submitting a PR, but not sure if it will be accepted as the change we made requires support from the Autoscaling Group (a Lifecycle Hook needs to be created to ensure newly spun EC2 instances stay in "Pending" state for the Lambda until the agents are ready), and Autoscaling Group lives in a different CloudFormation Stack (therefore requires changes to both repos at once with a "flag" to enable this feature).

Once we finish all the testing and implementation we'll consider making the PR.

moskyb commented 5 months ago

awesome! looking forward to seeing a PR, if one comes. very happy to consult as necessary.