Feature Proposal: Asymmetric Scaling Policy

Background

The current autoscaler logic roughly:

Collect metrics over time interval (default last 5m)
1. Over that interval, find the max value for that metric (High Priority CPU, 24H CPU, Storage)
Determine appropriate spanner size
1. maxSuggestedSize = min_size
2. Count spanner dbs: maxSuggestedSize = size implied by number of dbs
3. For each metric, determine proposed size (subject to scaleInLimit in Linear) - also do rounding
4. maxSuggestedSize = max for all metric proposals
5. Final proposal = min(maxSuggestedSize, max_size)
Check final proposal and do scaling
1. If ongoing scaling: exit
2. If within cooldown period: exit
3. Do scaling event

Problems

The auto scaler logic is greedy which can lead to scaling "bouncing". If you have a "calmer" 5m period then it will scale you in. Spanner scaling events are not smooth experiences and we'd like to avoid when possible. Here is an important production database (this example from today):

We don't want that scale in behavior, but we do want the immediate scale out behavior. We'd prefer an "Asymmetric Policy", example:

Scale in policy: "only scale in when it is clearly a good idea" "if you see that for over 1h/2h/4h that you want to scale in the entire time without exception, then scale in."

Scale out policy: "scale out whenever things get hot" "if you see over the last 5m that you need more, go ahead and scale out."

This is not easily expressible right now.

scaleInCoolingMinutes would let the bounce happen.
add another high priority CPU metric with look back "period" = 4h (or, 4h divided by 5). The problem here is that if there is a spike and it scales out, then even after scaling out it will see that spike in the 4h look back and then want to scale out again.
The only solution I can see is to use Custom Scaling Methods. You'd have to define some scale_out metrics which have the short lookback (5m) and scale_in metrics that have the longer lookback (4hr) then process those metrics differently in the calculateSize()

Asymmetric Metrics Error Handling - In the case where metrics are bad - see https://github.com/cloudspannerecosystem/autoscaler/issues/355 we'd want different behavior on scale out/scale in. If you get an incomplete metric set (say a CPU metric is zero), the entire possibility of scaling in should be discarded (fail static). If a single metric does return with signal that scaling out is warranted, then scale out should happen, an incomplete signal can be enough to scale out. Again this could be specially treated in a Custom Scaling Method.

Summary

I think I can make things work using Custom Scaling Method and maybe I will do that, but I think in general users of this project probably want the same things I do, and addressing this in the core project would be a good idea. Thanks.

Side note: How "Want Scale To" log metric works:

Examples:
            storage=0.5885764076510477%, BELOW the range [70%-80%] => however, cannot scale to 100 because it is lower than MIN 6000 PROCESSING_UNITS
            high_priority_cpu=7.566610239183218%, BELOW the range [60%-70%] => however, cannot scale to 700 because it is lower than MIN 6000 PROCESSING_UNITS

"want_scale_to %%{data:ignore}cannot scale to %%{number:want_scale_to}%%{data:ignore}"

cloudspannerecosystem / autoscaler

Feature Proposal: Asymmetric Scaling Policy #357

Background

Problems

Summary