DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.89k stars 1.21k forks source link

Does DD agent collect container TCP/UDP connection state from cadvisor? #3242

Open dndungu opened 5 years ago

dndungu commented 5 years ago

I have read all the docs and searched all the open source DD code but could not find any way to enable collection of TCP/UDP state metrics. We are interested in getting number of incoming connections in a pod. I can see the metrics in cadvisor.

xvello commented 5 years ago

Hi @dndungu

The kubelet check already collects several network metrics for each container. This is where one would add connection state metric collection.

Looking at the openmetrics payload from the kubelet, I'm seeing two gauges matching your description:

# HELP container_network_tcp_usage_total tcp connection usage statistic for container
# TYPE container_network_tcp_usage_total gauge
# HELP container_network_udp_usage_total udp connection usage statistic for container
# TYPE container_network_udp_usage_total gauge

Are these the ones you had in mind? If so, I'll be adding these to our roadmap.

dndungu commented 5 years ago

Hi @xvello,

Yes, these metrics are what I have in mind. We want monitor container open network connections.

# HELP container_network_tcp_usage_total tcp connection usage statistic for container
# TYPE container_network_tcp_usage_total gauge
# HELP container_network_udp_usage_total udp connection usage statistic for container
# TYPE container_network_udp_usage_total gauge

Please update us when you have an estimate on when we can get this in the DD agent.

Thanks.

xvello commented 5 years ago

Hello,

As 6.11 is already in freeze, this has been prioritized for 6.12, due the week of May 20th.

Regards

dndungu commented 5 years ago

Thanks @xvello

CharlyF commented 5 years ago

Hey @dndungu I just wanted to follow up on this. The metrics you mentioned are from cadvisor and they are disabled by default: https://github.com/google/cadvisor/blob/master/docs/runtime_options.md#metrics disable_metrics=tcp, udp

The kubelet embeds cadvisor and it can't be configured (including updating disable_metrics).

As a result, the only solution is to run cadvisor as a daemonset and activate the options (see DS below as an example).

We thought the metrics would be easy to collect because up until cadvisor 0.31.0 (which was embedded in the kubelet until 1.12), disabled metrics would show up but as 0. As per: https://github.com/kubernetes/kubernetes/issues/60279. So we were seeing the metric names, but did not realise that the metrics were wrong.

Please find attached an example file that will run cadvisor as a daemonset configured with annotations so that the agents autodiscovers the pods and run a generic open metrics check to retrieve those metrics:

apiVersion: apps/v1 # for Kubernetes versions before 1.9.0 use apps/v1beta2
kind: DaemonSet
metadata:
  name: cadvisor
spec:
  selector:
    matchLabels:
      name: cadvisor
  template:
    metadata:
      annotations:
        ad.datadoghq.com/cadvisor.check_names: '["prometheus"]'
        ad.datadoghq.com/cadvisor.init_configs: '[{}]'
        ad.datadoghq.com/cadvisor.instances: '[{"prometheus_url": "http://%%host%%:8080/metrics",
          "namespace": "cadvisor", "metrics": ["container_network_tcp_usage_total",
          "container_network_udp_usage_total"]}]'
      labels:
        name: cadvisor
    spec:
      containers:
      - name: cadvisor
        args:
          - --housekeeping_interval=10s                    # kubernetes default args
          - --max_housekeeping_interval=15s
          - --event_storage_event_limit=default=0
          - --event_storage_age_limit=default=0
          - --disable_metrics=percpu # enable only diskIO, cpu, memory, network, disk,tcp, udp, process
          - --docker_only
        image: k8s.gcr.io/cadvisor:v0.33.0
        resources:
          requests:
            memory: 200Mi
            cpu: 150m
          limits:
            cpu: 300m
        volumeMounts:
        - name: rootfs
          mountPath: /rootfs
          readOnly: true
        - name: var-run
          mountPath: /var/run
          readOnly: true
        - name: sys
          mountPath: /sys
          readOnly: true
        - name: docker
          mountPath: /var/lib/docker
          readOnly: true
        - name: disk
          mountPath: /dev/disk
          readOnly: true
        ports:
          - name: http
            containerPort: 8080
            hostPort: 8080
            protocol: TCP
      terminationGracePeriodSeconds: 30
      volumes:
      - name: rootfs
        hostPath:
          path: /
      - name: var-run
        hostPath:
          path: /var/run
      - name: sys
        hostPath:
          path: /sys
      - name: docker
        hostPath:
          path: /var/lib/docker
      - name: disk
        hostPath:
          path: /dev/disk
    prometheus (3.2.0)
    ------------------
      Instance ID: prometheus:cadvisor:42ceda833367bd8d [OK]
      Total Runs: 31
      Metric Samples: Last Run: 180, Total: 5,580
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 31
      Average Execution Time : 1.008s

Disclaimer: Those will count as custom metrics, also some labels appear empty and might end up generating tags that are not really usable. I did not spend too much time digging into that.

Finally, this is just an exemple to configure cadvisor, more details can be found in their official doc here

As the kubernetes community wants to remove cadvisor from the kubelet eventually we are going to suggest adding those metrics directly in the kubelet as well.

Best, .C

adammw commented 5 years ago

We (on the same team as @dndungu who's out on PTO this week) are running cadvisor as a daemonset already, but we don't have those annotations since we were using the kubelet check to point to it instead. If we migrate to using these annotations instead, is there a way of keeping the same metrics format that the kubelet check provides to not have to update any monitors?