Add flags to disable per-volume metric reporting

deniseyu commented 4 years ago

By default, the Datadog agent's disk reporter wants to export metrics for every volume. Concourse workers have a LOT of volumes and we're not interested in this degree of cardinality.

Signed-off-by: Denise Yu dyu@pivotal.io

I know we normally try to be additive with changes to ops files, but I seriously don't believe that anyone wants this level of cardinality for disk usage. We're not concerned about disk usage on these volumes, only on the hosts.

cirocosta commented 4 years ago

Hey,

I was trying to get a better sense of how this all gets down to the disk collector, so here's the "trace" (might be useful for us in the future).

First, by configuring generate_disk_config to false in the ops file, and passing disk_yaml_config, the job takes care of placing all of the data under dd.disk_yaml_config to the file that the disk collector from dd-agent looks at:

  dd.generate_disk_config:
    default: yes
    description: Generate disk configuration, disk.yaml

  dd.disk_yaml_config:
    default: ""
    description: Disk integration YAML configuration

(see job spec)

<% if p('dd.generate_disk_config') == true || p('dd.generate_disk_config') =~ (/(true|t|yes|y|1)$/i) %>
mkdir -p ${CONFD_DIR}/disk.d
cat <<EOF > "${CONFD_DIR}/disk.d/disk.yaml"
---
init_config:
instances:
  - use_mount: yes
    tag_by_filesystem: <%= p("dd.generate_disk_config_tag_by_filesystem", "yes") %>
    all_partitions: <%= p("dd.generate_disk_config_all_partitions", "yes") %>
    excluded_filesystems:
      - tracefs
EOF
<% elsif p('dd.disk_yaml_config') != "" %>
mkdir -p ${CONFD_DIR}/disk.d
cat <<EOF > "${CONFD_DIR}/disk.d/disk.yaml"
<%= p('dd.disk_yaml_config') %>
EOF
<% end %>

(see template)

Which, in the case of this PR, makes it write down to the disk config (/var/vcap/data/jobs/dd-agent/config/conf.d/disk.d/disk.yaml):

init_config:
instances:
  - use_mount: yes
    tag_by_filesystem: true
    all_partitions: true
    device_blacklist:
    - /var/vcap/data/worker/work/volumes/*/**
    excluded_filesystems:
      - tracefs

The configuration we pass here has some interesting properties:

we do not use the default per-device collection (we set use_mount: true)

## @param use_mount - boolean - required
## Instruct the check to collect using mount points instead of volumes.
#
use_mount: false

we do not go with the default of not adding a tag that mentions the fs (thus, we tag each measurement w/ its corresponding filesystem) - seems like a nice to have (we set tag_by_filesystem: true)

## @param tag_by_filesystem - boolean - optional
## Instruct the check to tag all disks with their file system e.g. filesystem:ntfs.
#
# tag_by_filesystem: false

we blacklist all the mounts under /var/vcap/data/worker/work/*/**, the place where concourse worker mounts each of the volumes that it creates. This is currently not right (that's not a valid regexp), but the idea remains.

## @param device_blacklist - list of regex - optional
## Instruct the check to not collect from matching devices.
##
## Character casing is ignored on Windows. For convenience, the regular
## expressions start matching from the beginning and therefore to match
## anywhere you must prepend `.*`. For exact matches append `$`.
##
## When conflicts arise, this will override `device_whitelist`.
#
# device_blacklist:
#   - /dev/sde
#   - [FJ]:

https://github.com/DataDog/integrations-core/blob/cdf89624ebed4372753f6f6f167608eeb955747d/disk/datadog_checks/disk/data/conf.yaml.default#L1

To test this in practice, I went with the docker approach as documented in their guides (https://docs.datadoghq.com/agent/docker/?tab=standard) with the following args:

    DOCKER_CONTENT_TRUST=1 docker run -d --name dd-agent \
            -v $(realpath ./datadog.yaml):/etc/datadog-agent/datadog.yaml \
            -v $(realpath ./disk.d/config.yaml):/etc/datadog-agent/conf.d/disk.d/config.yaml \
            -v /var/run/docker.sock:/var/run/docker.sock:ro \
            -v /proc/:/host/proc/:ro \
            -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
            -e DD_API_KEY=$API_KEY \
                    datadog/agent:7

where datadog.yaml is

    tags:
      - environment:ciro-test

and disk/config.yaml is whatever we want (e.g., the config above).

With regards to the implementation of disk checking, here's where all goes on:

https://github.com/DataDog/integrations-core/blob/cae4e8a77f3539a091899b911fcd60db0c815bfb/disk/datadog_checks/disk/disk.py#L55-L68

(despite the "agent" being in Go, the "checker" is in python)

cirocosta commented 4 years ago

Hey,

It seems like we'll need to do similar work for network ifaces too (despite the number being an order of magnitude compared to disk mounts. it's still pretty big - at least one network iface per container):

^ for a single host w/out many containers, we can see that we have ~45 values for device on a single host

ps.: this value at least rotates every hour, assuming each of those containers are check containers (thus, generating more timeseries inside datadog - assuming their doing timeseries in a standard way)

I'll set up an issue for that too (might lower a lot the load on some folks' datadog accounts).

cirocosta commented 4 years ago

hey @deniseyu , I just update the configuration (see last two commits).

wdyt?

it got us down to a good number (fixed size) of distinct vals:

(^ that drop is from the moment I applied the config)

thanks!

concourse / concourse-bosh-deployment

Add flags to disable per-volume metric reporting #195