facebookincubator / dynolog

Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
MIT License
260 stars 38 forks source link

Support DCGM version 3 #63

Closed briancoutinho closed 1 year ago

briancoutinho commented 1 year ago

Currently dcgm headers dynolog uses are version 2. This leads to problems if the system has dcgm version 3 To fix this we need to

References https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-admin.html#auxilary-information-about-dcgm-engine