Dabz / ccloudexporter

Prometheus exporter for Confluent Cloud API metric
https://docs.confluent.io/current/cloud/metrics-api.html
87 stars 53 forks source link

Timeout issue on EKS #85

Closed danksim closed 3 years ago

danksim commented 3 years ago

Converted the ccloudexporter kubernetes files into a helm chart and am running into a timeout issue.

The deployment has these env vars set:

env:
- name: CCLOUD_API_KEY
  value: "vault:secret/grafana/kafka/ccloud#CCLOUD_API_KEY"
- name: CCLOUD_API_SECRET
  value: "vault:secret/grafana/kafka/ccloud#CCLOUD_API_SECRET"
- name: CCLOUD_CLUSTER
  value: {{ .Values.cluster }}

Seeing:

kubectl logs -n grafana ccloud-exporter-deployment-cdcbbbb67-wq9hr -f
{
  "error": "Get \"https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/resources\": dial tcp 52.38.184.52:443: i/o timeout",
  "level": "fatal",
  "msg": "HTTP query for the descriptor endpoint failed",
  "time": "2021-08-31T18:18:57Z"
}

which tells me the env vars (api_key|secret) are valid but the request is timing out.

Did a little test with a test pod:

› cat test.yaml                                                                                                                                                ☠️
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-name
  namespace: grafana
spec:
  containers:
  - name: test-pod-name
    env:
    - name: CCLOUD_API_KEY
      value: vault:secret/grafana/kafka/ccloud#CCLOUD_API_KEY
    - name: CCLOUD_API_SECRET
      value: vault:secret/grafana/kafka/ccloud#CCLOUD_API_SECRET
    command: ["/bin/bash", "-c"]
    args:
    - curl -u $CCLOUD_API_KEY:$CCLOUD_API_SECRET https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/resources\?resource_type\=kafka

and I see:

› kubectl logs -n grafana test-pod-name -f
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   590  100   590    0     0      2      0  0:04:55  0:03:35  0:01:20   131
{"data":[{"type":"kafka","description":"A Kafka cluster","labels":[{"description":"ID of the Kafka cluster","key":"kafka.id"}]},{"type":"connector","description":"A Kafka Connector","labels":[{"description":"ID of the connector","key":"connector.id"}]},{"type":"ksql","description":"A ksqlDB application","labels":[{"description":"ID of the ksqlDB application","key":"ksql.id"}]},{"type":"schema_registry","description":"A schema registry","labels":[{"description":"ID of the schema registry","key":"schema_registry.id"}]}],"meta":{"pagination":{"page_size":100,"total_size":4}},"links":{}}%

and sometimes:

› kubectl logs -n grafana test-pod-name -f
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:04:17 --:--:--     0
curl: (28) Failed to connect to api.telemetry.confluent.cloud port 443: Connection timed out

Looks like it's taking too long and ends up timing out at times.

Any idea what could cause this in EKS?