cloudspannerecosystem / autoscaler

Automatically scale the capacity of your Spanner instances based on their utilization.
Apache License 2.0
86 stars 33 forks source link

Best practice poller schedule? #344

Closed alexlo03 closed 2 months ago

alexlo03 commented 2 months ago

Hello,

Out of the box the poller function runs every 2m, but then queries metrics for only the last 60s, so the poller could miss a spike in the 60s that is not examined. Is this recommended or should the poller schedule be updated to once per minute?

davidcueva commented 2 months ago

Hi Alex

The period key/value pair in the metrics does not limit the query to the last 60s. It is actually the alignment period, which is used to divide the time series into buckets, and then calculate a single value for each of those periods/buckets using the selected aligner. Here's more information.

https://github.com/cloudspannerecosystem/autoscaler/blob/4248b1a1400baa06cd99b0d8ca25a37dc52e706f/src/poller/poller-core/index.js#L98-L99

David

alexlo03 commented 2 months ago

@davidcueva Thanks for the answer. Sorry for the dumb questions - does this describe the behavior:

Every 2m the poller runs- the metrics query is:

A 5 minute interval (aka window) (assuming default 60s periods)

so there are 5 periods of 60s: it gets the max value for each period. The reducer is a no-op because there is only one data series. the group by location means that a series is returned from each location in the case of a multi-regional instance.

The code gets back the data points and iterates through them identifying the max value it saw and which location/region it came from.

Is this accurate?

==

This means that if you have a high utilization value, it will be in two consecutive runs - example log:

Screenshot 2024-07-09 at 11 00 43

and the scaleOutCoolingMinutes is the only thing that stops a double scaling

== side note

If anyone is interested you can do the query: https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.timeSeries/list

metric.type = "spanner.googleapis.com/instance/cpu/utilization_by_priority" AND resource.labels.instance_id="instance_name" AND metric.label.priority="high"

Screenshot 2024-07-09 at 10 15 06
davidcueva commented 2 months ago

There are no dumb questions.

Yes, that describes the current behaviour, but one important detail is that that scaleOutCoolingMinutes is not the only thing that stops a double scaling. We register the Long Running Operation (LRO) returned by the scaling request, and we prevent new scaling requests if there is an ongoing LRO. Even without this check, if you send multiple scaling requests to Spanner and there is already an ongoing scaling process, then the new request will be rejected by Spanner.