amazon-archives / k8s-cloudwatch-adapter

An implementation of Kubernetes Custom Metrics API for Amazon CloudWatch
Apache License 2.0
157 stars 98 forks source link

ExternalMetric reports incorrect value #52

Open rbrigden opened 4 years ago

rbrigden commented 4 years ago

We have been noticing inconsistencies between the metric value reported by the HPA and the metric value reported from CW. We are struggling to scale our system to keep up with a work queue and would appreciate some clarity.

I have the following setup for a custom metric that is posted to CW every 15 minutes. It is in OUR/NAMESPACE, has a single dimension QUEUE and is named QUEUE_SIZE.

ExternalMetric

apiVersion: metrics.aws/v1alpha1
kind: ExternalMetric
metadata:
  name: <replace>-queue-length
spec:
  name: <replace>-queue-length
  resource:
    resource: "deployment"
  queries:
    - id: <replace>
      metricStat:
        metric:
          namespace: "OUR/NAMESPACE"
          metricName: "QUEUE_SIZE"
          dimensions:
            - name: QUEUE
              value: "<replace>"
        period: 1800
        stat: Average
        unit: Count
      returnData: true

HPA

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
  name: <replace>-scaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: our-deployment
  minReplicas: 1
  maxReplicas: 200
  metrics:
    - type: External
      external:
        metricName: <replace>-queue-length
        targetAverageValue: 10

We run the CW query directly as suggested in another issue

aws cloudwatch get-metric-statistics --metric-name QUEUE_SIZE --start-time 2020-09-13T07:30:00z --end-time 2020-09-13T08:20:00z --period=1800 --namespace OUR/NAMESPACE --statistics Average --dimensions Name=QUEUE,Value=<replace> --unit Count
{
    "Label": "QUEUE_SIZE",
    "Datapoints": [
        {
            "Timestamp": "2020-09-13T07:30:00Z",
            "Average": 381.8333333333333,
            "Unit": "Count"
        }
    ]
}

We inspect the HPA and see the following

Name:                                                 <replace>-scaler
Namespace:                                            web
Labels:                                               <none>
Annotations:                                          <none>
CreationTimestamp:                                    Sun, 13 Sep 2020 00:49:13 -0700
Reference:                                            Deployment/our-deployment
Metrics:                                              ( current / target )
  "<replace>-queue-length" (target average value):  10778m / 10
Min replicas:                                         1
Max replicas:                                         200
Deployment pods:                                      36 current / 36 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from external metric <replace>-queue-length(nil)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range

The reported value appears to be nowhere close to the true value in CW. We follow the logs in the metrics adapter and it claims to successfully capture and report the external metric.

We would appreciate any tips to help us have the correct metric value supplied to the HPA. Thanks!

chankh commented 4 years ago

Hi, can you also provide the log output from the adapter?