Several changes have been made to the autoscaling behavior and additional tests have been implemented to validate the correct function of scaling mechanisms.
TestReplicaCountConvergingDownward and TestReplicaCountConvergingDownwardBlocked have been added to check the correct behavior when the replica count is converging downward and when this process is blocked respectively.
stableRegime is introduced which tracks whether the metrics are within watermarks.
canScaleAfterDelay is modified to allow scaling events not only if the metrics have been out of bounds for a longer duration than specified in the spec, but it also takes account of isStable that won't block if all metrics are within watermarks.
comprehensive test conditions have been added to verify the behavior when converging towards a watermark is in a stable regime, and also when it is blocked by the forbidden window. Other scaling related conditions and attributions are also thoroughly tested.
Motivation
The overall effect of these changes should lead to a more robust and accurate autoscaling system that can adapt better to system load and resource configuration changes.
Verify that the delay if used, is skipped and the forbidden window is respected.
Old behavior:
{"level":"info","ts":1695245260174.7908,"logger":"controllers.WatermarkPodAutoscaler","msg":"Will not scale: value has not been out of bounds for long enough","watermarkpodautoscaler":"default/example-watermarkpodautoscaler","wpa_name":"example-watermarkpodautoscaler","wpa_namespace":"default","time_left":-180416}
New Behavior:
{"level":"info","ts":1695308929742.9177,"logger":"controllers.WatermarkPodAutoscaler","msg":"Trying to scale down to converge to High Watermark","watermarkpodautoscaler":"default/example-watermarkpodautoscaler","wpa_name":"example-watermarkpodautoscaler","wpa_namespace":"default","optimisedReplicaCount":25,"currentReadyReplicas":26,"adjustedUsageAfterDownscale":1016053761,"highMark":1900000000}
{"level":"info","ts":1695308929742.9695,"logger":"controllers.WatermarkPodAutoscaler","msg":"Within bounds of the watermarks","watermarkpodautoscaler":"default/example-watermarkpodautoscaler","wpa_name":"example-watermarkpodautoscaler","wpa_namespace":"default","usage":"976974770m","replicaCount":25,"currentReadyReplicas":26,"tolerance (%)":1,"adjustedLM":495000000,"adjustedHM":1919000000,"adjustedUsage":976974770}
{"level":"info","ts":1695308929743.0073,"logger":"controllers.WatermarkPodAutoscaler","msg":"External Metric replica calculation","watermarkpodautoscaler":"default/example-watermarkpodautoscaler","wpa_name":"example-watermarkpodautoscaler","wpa_namespace":"default","metricName":"container.memory.usage","replicaCount":25,"utilizationQuantity":976974770,"timestamp":1695308850000,"currentReadyReplicas":26}
{"level":"info","ts":1695308929743.0874,"logger":"controllers.WatermarkPodAutoscaler","msg":"Proposing replicas","watermarkpodautoscaler":"default/example-watermarkpodautoscaler","wpa_name":"example-watermarkpodautoscaler","wpa_namespace":"default","proposedReplicas":25,"metricName":"container.memory.usage{map[cluster-name:datadogoperatorqa kube_deployment:alpine]}","reference":"Deployment/default/alpine","metric timestamp":"Thu, 21 Sep 2023 15:07:30 UTC"}
{"level":"info","ts":1695308929744.3916,"logger":"controllers.WatermarkPodAutoscaler","msg":"Normalized Desired replicas","watermarkpodautoscaler":"default/example-watermarkpodautoscaler","wpa_name":"example-watermarkpodautoscaler","wpa_namespace":"default","desiredReplicas":25}
{"level":"info","ts":1695308929744.4622,"logger":"controllers.WatermarkPodAutoscaler","msg":"Cooldown status","watermarkpodautoscaler":"default/example-watermarkpodautoscaler","wpa_name":"example-watermarkpodautoscaler","wpa_namespace":"default","backoffUp":false,"backoffDown":false,"desiredReplicas":25,"currentReplicas":26}
{"level":"info","ts":1695308929776.0188,"logger":"controllers.WatermarkPodAutoscaler","msg":"Successful rescale","watermarkpodautoscaler":"default/example-watermarkpodautoscaler","wpa_name":"example-watermarkpodautoscaler","wpa_namespace":"default","currentReplicas":26,"desiredReplicas":25,"rescaleReason":"Metric within watermarks, attempting to scale to converge towards watermark"}
What does this PR do?
Several changes have been made to the autoscaling behavior and additional tests have been implemented to validate the correct function of scaling mechanisms.
TestReplicaCountConvergingDownward
andTestReplicaCountConvergingDownwardBlocked
have been added to check the correct behavior when the replica count is converging downward and when this process is blocked respectively.stableRegime
is introduced which tracks whether the metrics are within watermarks.canScaleAfterDelay
is modified to allow scaling events not only if the metrics have been out of bounds for a longer duration than specified in the spec, but it also takes account ofisStable
that won't block if all metrics are within watermarks.comprehensive test conditions have been added to verify the behavior when converging towards a watermark is in a stable regime, and also when it is blocked by the forbidden window. Other scaling related conditions and attributions are also thoroughly tested.
Motivation
The overall effect of these changes should lead to a more robust and accurate autoscaling system that can adapt better to system load and resource configuration changes.
Describe your test plan
For a spec like this:
Verify that the delay if used, is skipped and the forbidden window is respected. Old behavior:
New Behavior: