cloudspannerecosystem / autoscaler

Automatically scale the capacity of your Spanner instances based on their utilization.
Apache License 2.0
86 stars 33 forks source link

Feature Proposal: Asymmetric Scaling Policy #357

Open alexlo03 opened 1 month ago

alexlo03 commented 1 month ago

Background

The current autoscaler logic roughly:

  1. Collect metrics over time interval (default last 5m)
    1. Over that interval, find the max value for that metric (High Priority CPU, 24H CPU, Storage)
  2. Determine appropriate spanner size
    1. maxSuggestedSize = min_size
    2. Count spanner dbs: maxSuggestedSize = size implied by number of dbs
    3. For each metric, determine proposed size (subject to scaleInLimit in Linear) - also do rounding
    4. maxSuggestedSize = max for all metric proposals
    5. Final proposal = min(maxSuggestedSize, max_size)
  3. Check final proposal and do scaling
    1. If ongoing scaling: exit
    2. If within cooldown period: exit
    3. Do scaling event

Problems

  1. The auto scaler logic is greedy which can lead to scaling "bouncing". If you have a "calmer" 5m period then it will scale you in. Spanner scaling events are not smooth experiences and we'd like to avoid when possible. Here is an important production database (this example from today): Screenshot 2024-07-16 at 09 50 40

We don't want that scale in behavior, but we do want the immediate scale out behavior. We'd prefer an "Asymmetric Policy", example:

Scale in policy: "only scale in when it is clearly a good idea" "if you see that for over 1h/2h/4h that you want to scale in the entire time without exception, then scale in."

Scale out policy: "scale out whenever things get hot" "if you see over the last 5m that you need more, go ahead and scale out."

This is not easily expressible right now.

  1. Asymmetric Metrics Error Handling - In the case where metrics are bad - see https://github.com/cloudspannerecosystem/autoscaler/issues/355 we'd want different behavior on scale out/scale in. If you get an incomplete metric set (say a CPU metric is zero), the entire possibility of scaling in should be discarded (fail static). If a single metric does return with signal that scaling out is warranted, then scale out should happen, an incomplete signal can be enough to scale out. Again this could be specially treated in a Custom Scaling Method.

Summary

I think I can make things work using Custom Scaling Method and maybe I will do that, but I think in general users of this project probably want the same things I do, and addressing this in the core project would be a good idea. Thanks.

==

Side note: How "Want Scale To" log metric works:

Examples:
            storage=0.5885764076510477%, BELOW the range [70%-80%] => however, cannot scale to 100 because it is lower than MIN 6000 PROCESSING_UNITS
            high_priority_cpu=7.566610239183218%, BELOW the range [60%-70%] => however, cannot scale to 700 because it is lower than MIN 6000 PROCESSING_UNITS

"want_scale_to %%{data:ignore}cannot scale to %%{number:want_scale_to}%%{data:ignore}"