Service Discovery based checks are breaking on every deploy (Kubernetes)

felipejfc commented 7 years ago

Hi, I have a check that extends PrometheusCheck, the pods it should scrape are being discovered using service discovery.

On every deploy I'm having to delete all pods from my dd-agent daemonset for the checks to report consistent metrics again, the error is that:

    my_check (5.18.1)
    --------------------------
      - instance #0 [ERROR]: "HTTPConnectionPool(host='100.124.142.246', port=9090): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f715b861ed0>: Failed to establish a new connection: [Errno 110] Operation timed out',))"
      - Collected 0 metrics, 0 events & 0 service checks

This host "100.124.142.246" is not the ip of any of the new pods I'm deploying (I'm updating a deployment).

It's like agent is not refreshing the pods it should be checking against.

felipejfc commented 7 years ago

@mikekap

felipejfc commented 6 years ago

ping

hkaj commented 6 years ago

Hi @felipejfc Sorry for the delay. Could you give us more details about how you deploy the agent? If you could send us the manifest of the daemonset you use (minus the api key, secrets, etc.) it would be great. Also please send us a flare from the agent https://help.datadoghq.com/hc/en-us/articles/204991415-Send-logs-and-configs-to-Datadog-via-flare-command

you can mention this issue in the support ticket it will create.

Finally is 100.124.142.246 in the address space of your pods, or nodes, or something else in the cluster? I'm trying to understand how this address was resolved.

Thanks

felipejfc commented 6 years ago

hi @hkaj, thanks for your response... lets go then:

This is the daemonset yaml:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: dd-agent
  name: dd-agent
  namespace: datadog
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: dd-agent
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: dd-agent
      name: dd-agent
    spec:
      containers:
      - env:
        - name: API_KEY
          value: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
        - name: KUBERNETES
          value: "yes"
        - name: SD_BACKEND
          value: docker
        - name: NON_LOCAL_TRAFFIC
          value: "true"
        - name: DD_HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: datadog/docker-dd-agent:latest-alpine
        imagePullPolicy: Always
        name: dd-agent
        ports:
        - containerPort: 8125
          name: dogstatsd
          protocol: UDP
        resources:
          limits:
            memory: 400Mi
          requests:
            memory: 200Mi
        volumeMounts:
        - mountPath: /var/run/docker.sock
          name: dockersocket
        - mountPath: /host/proc
          name: procdir
          readOnly: true
        - mountPath: /host/sys/fs/cgroup
          name: cgroups
          readOnly: true
        - mountPath: /opt/datadog-agent/agent/checks.d/matchmaking_check.py
          name: matchmaker-check
          readOnly: true
          subPath: matchmaking_check.py
        - mountPath: /opt/datadog-agent/agent/checks.d/tre-check.py
          name: tre-check
          readOnly: true
          subPath: tre-check.py
        - mountPath: /opt/datadog-agent/agent/checks.d/sidecar-check.py
          name: sidecar-check
          readOnly: true
          subPath: sidecar-check.py
        - mountPath: /opt/datadog-agent/agent/conf.d/auto_conf/tre-check.yaml
          name: datadog-config-volume
          readOnly: true
          subPath: tre-check.yaml
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      - effect: NoSchedule
        operator: Exists
      volumes:
      - hostPath:
          path: /var/run/docker.sock
        name: dockersocket
      - hostPath:
          path: /proc
        name: procdir
      - hostPath:
          path: /sys/fs/cgroup
        name: cgroups
      - configMap:
          defaultMode: 420
          name: matchmaker-check
        name: matchmaker-check
      - configMap:
          defaultMode: 420
          name: tre-check-cm
        name: tre-check
      - configMap:
          defaultMode: 420
          name: sidecar-check
        name: sidecar-check
      - configMap:
          defaultMode: 420
          name: datadog-config
        name: datadog-config-volume

matchmaker deployment has the following annotations:

spec:
  replicas: 3
  selector:
    matchLabels:
      app: matchmaker-api
  template:
    metadata:
      annotations:
        service-discovery.datadoghq.com/matchmaker-api.check_names: '["matchmaking_check"]'
        service-discovery.datadoghq.com/matchmaker-api.init_configs: '[{}]'
        service-discovery.datadoghq.com/matchmaker-api.instances: '[[{"prometheus_endpoint":
          "http://%%host%%:9090/metrics", "tags":[]}]]'
...

100.124.142.246 is in the pods address space. It's like when I update the deployment and new pods are created, dd-agent does not re-run the discovery and keeps pointing to the old ips...

A bit more information: This graph is a timeseries splitted by pod_name, as you see, first there were 3 pods reporting, then, I delete all 3 pods and kubernetes brought up other 3, the names of the 3 pods remains the same because it's a statefulset, after the new pods are up, dd-agent only collects metrics from 2 of them, the one that has no metrics is the "matchmaker-worker-2"

It's running on kube on node ip-172-20-65-131.ec2.internal:

matchmaker-worker-2               1/1       Running   0          9m        100.126.130.221   ip-172-20-65-131.ec2.internal

Then I've entered in the dd-agent that is running on the same node and this is what I see with bin/agent info:

    matchmaking_check (custom)
    --------------------------
      - instance #0 [ERROR]: "HTTPConnectionPool(host='100.126.130.245', port=9090): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fb181f0a5d0>: Failed to establish a new connection: [Errno 110] Operation timed out',))"
      - Collected 0 metrics, 0 events & 0 service checks

It's trying to get metrics from the wrong ip address, if I'm were to guess I'ld say it's the ip address of some of the old pods that I've just deleted.

DataDog / dd-agent

Service Discovery based checks are breaking on every deploy (Kubernetes) #3570