dcgm-exporter: DCP metrics not enabled

jelmd commented 3 years ago

Right now it seems, that dcgm-exporter always starts in embedded mode, i.e. there is no way to use an already running nv-hostengine and therefore there seems to be no way to enable profiling metrics (DCGM_FI_PROF_*).

So, it would be nice to:

get a CLI option to enable profiling metrics
get a CLI option to enable the use of an already running nv-hostengine
document the impact of enabling profiling metrics (e.g. does it hurt to have it enabled all the time?)

dualvtable commented 3 years ago

The dcgm-exporter Helm chart by default gathers DCP metrics (DCGM_FI_PROF_*). You can override the arguments field to provide your own metrics - so to answer your questions:

this is already supported by providing your own custom csv file to gather the metrics you need
DCP metrics are enabled by default. However, since they are only supported on datacenter GPUs, starting with 2.1.2, we added some error handling so the pod doesn't crash on unsupported GPUs
re: already running nv-hostengine, let me get back to you on this request. What is the use-case for this scenario?

jelmd commented 3 years ago

The dcgm-exporter Helm chart by default gathers DCP metrics (DCGM_FI_PROF_*).

Not sure, what dcgm-exporter Helm chart is. I compiled dcgm-exporter from source and use it in pure LXC containers.

You can override the arguments field to provide your own metrics - so to answer your questions:

Not sure what you really mean with "override the arguments field" and "provide your own metrics". I do not intend to create/provide other metrics than those already provided by the dcgm-exporter (at least in theory).

* this is already supported by providing your own custom csv file to gather the metrics you need

Yes, my config file looks like this:

DCGM_FI_DEV_BAR1_TOTAL, gauge, Total BAR1 of the GPU in MB.
DCGM_FI_DEV_BAR1_USED, gauge, Used BAR1 of the GPU in MB.
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
DCGM_FI_DEV_VIDEO_CLOCK, gauge, Video encoder/decoder clock for the device.
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS, gauge, Current clock throttle reasons (bitmask of DCGM_CLOCKS_THROTTLE_REASON_*).
DCGM_FI_DEV_MAX_SM_CLOCK, gauge, Maximum supported SM clock for the device.
DCGM_FI_DEV_MAX_MEM_CLOCK, gauge, Maximum supported Memory clock for the device.
DCGM_FI_DEV_MAX_VIDEO_CLOCK, gauge, Maximum supported Video encoder/decoder clock for the device.
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
#DCGM_FI_DEV_MEM_MAX_OP_TEMP, gauge, Maximum operating temperature for the memory of this GPU.
#DCGM_FI_DEV_GPU_MAX_OP_TEMP, gauge, Maximum operating temperature for this GPU.
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
DCGM_FI_DEV_SLOWDOWN_TEMP, gauge, Slowdown temperature for the device.
DCGM_FI_DEV_SHUTDOWN_TEMP, gauge, Shutdown temperature for the device.
DCGM_FI_DEV_POWER_MGMT_LIMIT, gauge, Current Power limit for the device.
DCGM_FI_DEV_POWER_MGMT_LIMIT_MIN, gauge, Minimum power management limit for the device.
DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, gauge, Maximum power management limit for the device.
DCGM_FI_DEV_POWER_MGMT_LIMIT_DEF, gauge, Default power management limit for the device.
DCGM_FI_DEV_ENFORCED_POWER_LIMIT, gauge, Effective power limit that the driver enforces after taking into account all limiters.
DCGM_FI_DEV_PSTATE, gauge, Performance state (P-State) 0-15. 0=highest.
DCGM_FI_DEV_FAN_SPEED, gauge, Fan speed for the device in percent 0-100.
DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).
DCGM_FI_DEV_GRAPHICS_PIDS, gauge, Graphics processes running on the GPU.
DCGM_FI_DEV_COMPUTE_PIDS, gauge, Compute processes running on the GPU.
DCGM_FI_DEV_XID_ERRORS,    gauge,   Value of the last XID error encountered.
DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
DCGM_FI_DEV_FB_TOTAL, gauge, Total Frame Buffer of the GPU (in MB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
#DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, gauge, Number of remapped rows for uncorrectable errors.
#DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, gauge, Number of remapped rows for correctable errors.
#DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed.
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.

DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES,      counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES,      counter, The number of bytes of active pcie rx data including both header and payload.
#
# NOTE: In contrast to the documentation (bindings/go/dcgm/dcgm_fields.h) dcgm-exporter does not know anything about the metrics above containing a  '#' at the SOL.

* DCP metrics are enabled by default. However, since they are only supported on datacenter GPUs, starting with 2.1.2, we added some error handling so the pod doesn't crash on unsupported GPUs

Hmmm, ok, most of our GPUs are RTX2080Ti, but we have some Tesla V100-SXM2-32GB as well, so I'll check.

* re: already running nv-hostengine, let me get back to you on this request. What is the use-case for this scenario?

Not really a use case. The documentation just says, that at least the API supports 3 modes: a) embedded, b) client-server, and c) mixed, i.e. client-server, but the client actually forks/controls the server. (client is obviously dcgm-exporter and server is nv-hostengine).

So if you ask about the use-case, you indirectly ask about the purpose of nv-hostengine. Hmmm, I've no idea, because most of the stuff can be done via nvidia-smi. Perhaps it is intended to make nvidia-smi more lightweight and let a central instance de-multiplex setting requests? But as said, I've no idea. Anyway, one is able to control/manage GPUs - and seems to be able to use dcgmi to enable DCP - it makes sense to run it as a single server process on the host and use dcgm-exporter just as a lightweight collector, exposing the metrics to the intended audience. Anyway, obviously dcgm-exporter seems to be everything else but not lightweight at all and has a notable influence on the metrics collected, which is a really bad thing of course.

NVIDIA / gpu-monitoring-tools

dcgm-exporter: DCP metrics not enabled #140