Dabz / ccloudexporter

Prometheus exporter for Confluent Cloud API metric
https://docs.confluent.io/current/cloud/metrics-api.html
87 stars 53 forks source link

Feature request: add new config, pause duration between retries #87

Closed tringuyen-yw closed 3 years ago

tringuyen-yw commented 3 years ago

Description

When running the ccloud-exporter container within Kubernetes. The CCloud Metrics API rate limit of 50 requets / second is easily hit. Even when the pod has livenessProbe and readinessProbe disabled. Example of such an error

  "error": "Received status code 429 instead of 200 for POST on https://api.telemetry.confluent.cloud//v2/metrics/cloud/query ()",
  "level": "error",
  "msg": "Query did not succeed",
  "optimizedQuery": {
    "aggregations": [
      {
        "agg": "SUM",
        "metric": "io.confluent.kafka.server/partition_count"
      }
    ],
    "filter": {
      "op": "AND",
      "filters": [
        {
          "op": "OR",
          "filters": [
            {
              "field": "resource.kafka.id",
              "op": "EQ",
              "value": "<redacted>"
            }
          ]
        }
      ]
    },
    "granularity": "PT1M",
    "group_by": [],
    "intervals": [
      "2021-09-06T17:09:00Z/PT1M"
    ],
    "limit": 1000
  },
  "response": {
    "data": null
  },
  "time": "2021-09-06T17:11:30Z"
}

Once the API rate limit is triggered, it tends to self-maintain in an infinite loop. Probably because ccloud-exporter retries in quick succession without giving enough wait time between metrics collection.

Proposed solution

Add a new config secondsBetweenRetry as waiting time between a retries when an access to CCloud Metrics API had failed. Ideally this pause should be at the individual API request, and not at the batch of requests, like 9 at a time as in config.simple.yaml.

Even better, this pause duration should follow an "exponential backoff" for example, starts at 5 seconds, then doubles at each new retry and be capped off at, let's say 5 minutes.

Dabz commented 3 years ago

Good point, in the next versions, I will try to add some protections regarding those limitations. I am not sure what would be the best approach though. So far, I was more edging toward implementing a cache and, in case of fast consecutive call, returns the cached result instead of querying the API.

We could also implement a retry or wait in the exporter, but there could be a risk that we will be hitting timeout due to Prometheus timeout.

Dabz commented 3 years ago

I added a cache mechanism with a configurable delay in #93 , this should protect the exporter from getting 429 response with the default configuration.

By the way, there is another liveness/readiness endpoint to avoid sending queries nowadays :)

tringuyen-yw commented 3 years ago

@Dabz does the docker image dabz/ccloudexporter:latest have this fix as well as the liveness/readiness endpoints?

Dabz commented 3 years ago

Yes it is @tringuyen-yw :) It is a contribution of @raytung and it has been merged as part of #91 The endpoint is http://xx:2112/health by default.