Closed deniseyu closed 4 years ago
Hey,
I was trying to get a better sense of how this all gets down to the disk
collector, so here's the "trace" (might be useful for us in the future).
First, by configuring generate_disk_config
to false
in the ops file, and
passing disk_yaml_config
, the job takes care of placing all of the data under
dd.disk_yaml_config
to the file that the disk
collector from dd-agent
looks at:
dd.generate_disk_config:
default: yes
description: Generate disk configuration, disk.yaml
dd.disk_yaml_config:
default: ""
description: Disk integration YAML configuration
(see job spec)
<% if p('dd.generate_disk_config') == true || p('dd.generate_disk_config') =~ (/(true|t|yes|y|1)$/i) %>
mkdir -p ${CONFD_DIR}/disk.d
cat <<EOF > "${CONFD_DIR}/disk.d/disk.yaml"
---
init_config:
instances:
- use_mount: yes
tag_by_filesystem: <%= p("dd.generate_disk_config_tag_by_filesystem", "yes") %>
all_partitions: <%= p("dd.generate_disk_config_all_partitions", "yes") %>
excluded_filesystems:
- tracefs
EOF
<% elsif p('dd.disk_yaml_config') != "" %>
mkdir -p ${CONFD_DIR}/disk.d
cat <<EOF > "${CONFD_DIR}/disk.d/disk.yaml"
<%= p('dd.disk_yaml_config') %>
EOF
<% end %>
(see template)
Which, in the case of this PR, makes it write down to the disk config
(/var/vcap/data/jobs/dd-agent/config/conf.d/disk.d/disk.yaml
):
init_config:
instances:
- use_mount: yes
tag_by_filesystem: true
all_partitions: true
device_blacklist:
- /var/vcap/data/worker/work/volumes/*/**
excluded_filesystems:
- tracefs
The configuration we pass here has some interesting properties:
use_mount: true
)## @param use_mount - boolean - required
## Instruct the check to collect using mount points instead of volumes.
#
use_mount: false
tag_by_filesystem: true
)## @param tag_by_filesystem - boolean - optional
## Instruct the check to tag all disks with their file system e.g. filesystem:ntfs.
#
# tag_by_filesystem: false
/var/vcap/data/worker/work/*/**
, the
place where concourse worker
mounts each of the volumes that it creates.
This is currently not right (that's not a valid regexp), but the idea
remains.## @param device_blacklist - list of regex - optional
## Instruct the check to not collect from matching devices.
##
## Character casing is ignored on Windows. For convenience, the regular
## expressions start matching from the beginning and therefore to match
## anywhere you must prepend `.*`. For exact matches append `$`.
##
## When conflicts arise, this will override `device_whitelist`.
#
# device_blacklist:
# - /dev/sde
# - [FJ]:
To test this in practice, I went with the docker
approach as documented in
their guides (https://docs.datadoghq.com/agent/docker/?tab=standard) with the
following args:
DOCKER_CONTENT_TRUST=1 docker run -d --name dd-agent \
-v $(realpath ./datadog.yaml):/etc/datadog-agent/datadog.yaml \
-v $(realpath ./disk.d/config.yaml):/etc/datadog-agent/conf.d/disk.d/config.yaml \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
-e DD_API_KEY=$API_KEY \
datadog/agent:7
where datadog.yaml
is
tags:
- environment:ciro-test
and disk/config.yaml
is whatever we want (e.g., the config above).
With regards to the implementation of disk checking, here's where all goes on:
(despite the "agent" being in Go, the "checker" is in python)
Hey,
It seems like we'll need to do similar work for network ifaces too (despite the number being an order of magnitude compared to disk mounts. it's still pretty big - at least one network iface per container):
^ for a single host w/out many containers, we can see that we have ~45 values for device
on a single host
ps.: this value at least rotates every hour, assuming each of those containers are check containers (thus, generating more timeseries inside datadog - assuming their doing timeseries in a standard way)
I'll set up an issue for that too (might lower a lot the load on some folks' datadog accounts).
hey @deniseyu , I just update the configuration (see last two commits).
wdyt?
it got us down to a good number (fixed size) of distinct vals:
(^ that drop is from the moment I applied the config)
thanks!
By default, the Datadog agent's disk reporter wants to export metrics for every volume. Concourse workers have a LOT of volumes and we're not interested in this degree of cardinality.
Signed-off-by: Denise Yu dyu@pivotal.io
I know we normally try to be additive with changes to ops files, but I seriously don't believe that anyone wants this level of cardinality for disk usage. We're not concerned about disk usage on these volumes, only on the hosts.