Open HongboDu-at opened 1 year ago
Hi @HongboDu-at, thanks for using the buildkite-agent-scaler.
This is an issue that quite a few customers are experiencing. Unfortunately, the metrics endpoint in the API does not expose information about which agents are on which hosts. It may be possible to reconstruct this information by equipping the scaler with a GraphQL token and extracting it from the GraphQL API.
Unfortunately, we don't have any work planned to do this at the moment, but we would welcome any PRs. I suggest making supplying the GraphQL (and therefore this functionality) optional as it is not needed by an Elastic Stack at the moment, and has a very broad scope.
We had to fork the repo and updated the scaling calculator to achieve this. We will look into submitting a PR, but not sure if it will be accepted as the change we made requires support from the Autoscaling Group (a Lifecycle Hook needs to be created to ensure newly spun EC2 instances stay in "Pending" state for the Lambda until the agents are ready), and Autoscaling Group lives in a different CloudFormation Stack (therefore requires changes to both repos at once with a "flag" to enable this feature).
Once we finish all the testing and implementation we'll consider making the PR.
awesome! looking forward to seeing a PR, if one comes. very happy to consult as necessary.
Situation
There is one host, which has 10 agents. When we gracefully terminate the host, idle agents self-terminate immediately but busy agents will finish the job and then self-terminate. This process could take as long as the when job finishes. ScalingCalculator behaves wrong in this situation. https://github.com/buildkite/buildkite-agent-scaler/blob/6fdf0bc30b25050b25b8096e22ff72aeb790df29/scaler/scaler.go#L66
Current behavior
ScalingCalculator thinks there are 10 agents available. But in reality, only busy agents are still alive. No new jobs will get agents to be worked on.
Expected behavior
ScalingCalculator should use live agents rather than configured AgentsPerInstance