Precision of metric returned to HPA from datadog

oodiete commented 5 years ago

Output of the info page (if this is a bug)

(Paste the output of the info page here)

Describe what happened:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: assortment-event-listener
  namespace: xxxxxx
spec:
  minReplicas: 1
  maxReplicas: 8
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: assortment-event-listener
  metrics:
  - type: External
    external:
      metricName: rabbitmq.queue.messages
      metricSelector:
          matchLabels:
            rabbitmq_queue: "event_bus.case_management"
      targetValue: 50

 ➜  kubernetes git:(master) ✗ kubectl describe hpa
Name:                                        assortment-event-listener
Namespace:                                   xxxxxx
Labels:                                      <none>
Annotations:                                 kubectl.kubernetes.io/last-applied-configuration:
                                               {"apiVersion":"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"assortment-event-listener","name...
CreationTimestamp:                           Tue, 27 Aug 2019 19:14:51 -0400
Reference:                                   Deployment/assortment-event-listener
Metrics:                                     ( current / target )
  "rabbitmq.queue.messages" (target value):  329500m / 50
Min replicas:                                1
Max replicas:                                8
Deployment pods:                             8 current / 8 desired
Conditions:
  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
  ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from external metric rabbitmq.queue.messages(&LabelSelector{MatchLabels:map[string]string{rabbitmq_queue: event_bus.case_management,},MatchExpressions:[],})
  ScalingLimited  True    TooManyReplicas      the desired replica count is more than the maximum replica count
Events:
  Type    Reason             Age                   From                       Message
  ----    ------             ----                  ----                       -------
  Normal  SuccessfulRescale  60m (x10 over 12h)    horizontal-pod-autoscaler  New size: 2; reason: external metric rabbitmq.queue.messages(&LabelSelector{MatchLabels:map[string]string{rabbitmq_queue: event_bus.case_management,},MatchExpressions:[],}) above target
  Normal  SuccessfulRescale  60m (x3 over 12h)     horizontal-pod-autoscaler  New size: 7; reason: external metric rabbitmq.queue.messages(&LabelSelector{MatchLabels:map[string]string{rabbitmq_queue: event_bus.case_management,},MatchExpressions:[],}) above target
  Normal  SuccessfulRescale  47m (x24 over 14h)    horizontal-pod-autoscaler  New size: 1; reason: All metrics below target
  Normal  SuccessfulRescale  4m3s (x17 over 14h)   horizontal-pod-autoscaler  New size: 4; reason: external metric rabbitmq.queue.messages(&LabelSelector{MatchLabels:map[string]string{rabbitmq_queue: event_bus.case_management,},MatchExpressions:[],}) above target
  Normal  SuccessfulRescale  3m48s (x24 over 14h)  horizontal-pod-autoscaler  New size: 8; reason: external metric rabbitmq.queue.messages(&LabelSelector{MatchLabels:map[string]string{rabbitmq_queue: event_bus.case_management,},MatchExpressions:[],}) above target

 ➜  kubernetes git:(master) ✗ kubectl get hpa     
NAME                        REFERENCE                              TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
assortment-event-listener   Deployment/assortment-event-listener   329500m/50      1         8         8          14h

root@datadog-cluster-agent-559df79864-9k4st:/# datadog-cluster-agent status
Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.3.2)
==============================

  Status date: 2019-08-28 14:04:52.261167 UTC
  Agent start: 
  Pid: 1
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System UTC time: 2019-08-28 14:04:52.261167 UTC

  Hostnames
  =========
    ec2-hostname: ip-xxx-xx-xx-xx
    hostname: i-0b1360486e5c7ac93
    instance-id: i-0b1360486e5c7ac93
    socket-fqdn: datadog-cluster-agent-559df79864-9k4st
    socket-hostname: datadog-cluster-agent-559df79864-9k4st
    hostname provider: aws
    unused hostname providers:
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Leader Election
  ===============
    Leader Election Status:  Running
    Leader Name is: datadog-cluster-agent-559df79864-9k4st
    Last Acquisition of the lease: Mon, 26 Aug 2019 19:07:40 UTC
    Renewed leadership: Wed, 28 Aug 2019 14:04:45 UTC
    Number of leader transitions: 15 transitions

  Custom Metrics Server
  =====================
    ConfigMap name: default/datadog-custom-metrics

    External Metrics
    ----------------
      Total: 1
      Valid: 1
      hpa:
      - name: assortment-event-listener
      - namespace: xxxxxx
      - uid: 78b82781-c920-11e9-ba51-0ec466f8e528
      labels:
      - rabbitmq_queue: event_bus.case_management
      metricName: rabbitmq.queue.messages
      ts: 1.56700101e&#43;09
      valid: true
      value: 329.5

=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Total Runs: 10,314
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 1,000
      Service Checks: Last Run: 3, Total: 30,927
      Average Execution Time : 170ms

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 10,313
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 241
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 20,867
    TimeseriesV1: 10,313

  API Keys status
  ===============
    API key ending with 3df00: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 3df00

Notice the precision of the metric in the result from datadog-cluster-agent status ran on the cluster-agent, and notice the precision after running kubectl get hpa and kubectl describe hpa. The worse part is sometimes the precision is good, and we get the same values on both side (when it is an integer), so imagine the autoscaler seeing 329500m as some point and then 300 at another and scaling up and down.

Describe what you expected:

I expected 329.5.

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):

DylanLovesCoffee commented 5 years ago

Hey @oodiete, thanks for raising the issue. I'd like to investigate this further but will require you to open up a support ticket so you can send a flare from the cluster-agent. This should let us inspect your logs and configurations more closely.

In your ticket, please reference this Github issue and attach the output of kubectl get --raw /apis/external.metrics.k8s.io/v1beta1/namespaces/<your-namespace>/rabbitmq.queue.messages | jq would also be helpful. If you're using any label selectors for this metric, just append ?labelSelector=<label> in the command, after the metric name. Thanks! Let me know if there are questions.

oodiete commented 5 years ago

@DylanLovesCoffee I updated the ticket with more info. I will also email the issue number to support.

oodiete commented 5 years ago

{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/external.metrics.k8s.io/v1beta1/namespaces/xxxxxx/rabbitmq.queue.messages"
  },
  "items": [
    {
      "metricName": "rabbitmq.queue.messages",
      "metricLabels": {
        "rabbitmq_queue": "event_bus.case_management"
      },
      "timestamp": "2019-08-28T14:42:56Z",
      "value": "33500m"
    }
  ]
}

DylanLovesCoffee commented 5 years ago

Hey @oodiete, thanks for updating the issue with your information! The values reported by the autoscaler suffixed with m represents milli (1/1000). The cluster-agent was built to support this with https://github.com/DataDog/datadog-agent/pull/3090 (>= cluster-agent v1.3.0)and will carry the conversion over into the agent status as a float type when it's deemed necessary, so we should not expect any differences between the two values when scaling.

oodiete commented 5 years ago

@DylanLovesCoffee thanks for the response although I don't think that solves my problem because the hpa does not understand the m and seems to see for example 329500m as a very very large value so the scaling behaviour is it scales really up when it get 329500m and down when it gets something like 329.500 even though they should be the same.

NAME                        REFERENCE                              TARGETS   MINPODS   MAXPODS   REPLICAS AGE
assortment-event-listener   Deployment/assortment-event-listener   329500m/50      1         8         8          14h

avivkarbolk commented 2 years ago

@oodiete, I facing the same issue, did you able to solve this?

CharlyF commented 2 years ago

Hi all, Let me know if I am misunderstanding the issue, but I believe this is just a data representation quirck and is not impacting the actual lifecycle of this feature. The m is one of the visual representations that is used by the Quantity type, this type is used throughout the codebase and if you want to implement the API interfaces.

For instance, the Cluster Agent, implementing the External Metrics API's interface and registered as the server here, returns the ExternalMetricValueList type, per the function GetExternalMetric. As you can see in the Kubernetes library, the Value is a Quantity and used as such in the code of the Horizontal Pod Autoscaler Controller. Now, when describing HPAs, as you can see here the status of HPAs gives the values with the Quantity type (aka, m) for the AverageValues (that is used for the External Metrics types) and an integer for AverageUtilization.

Per the doc on quantities:

// Before serializing, Quantity will be put in "canonical form".
// This means that Exponent/suffix will be adjusted up or down (with a
// corresponding increase or decrease in Mantissa) such that:
// - No precision is lost
// - No fractional digits will be emitted
// - The exponent (or suffix) is as large as possible.
//
// The sign will be omitted unless the number is negative.
// Examples:
// - 1.5 will be serialized as "1500m"
// - 1.5Gi will be serialized as "1536Mi"

The above is really geared towards addressing the following statement:

because the hpa does not understand the m and seems to see for example 329500m as a very very large value

For the HPA controller, 329500m=329.5. From the status, the reason there are scaling events is because the threshold is at 50 (though I might be missing context).

Lastly to address one other potential misunderstanding, I see that the value from the Cluster Agent status is used to compare the values representation. This is just how we chose to represent the value (as Dylan explained), per his template and the humanize function here

DataDog / datadog-agent

Precision of metric returned to HPA from datadog #4086