Open guoliangmiao opened 1 month ago
@guoliangmiao , According to the log the dcgm-exporter started.
Also, I see, that performance metrics aren't supported on your hardware: "Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
.
I found that the solution to the problem is to add the runtime: nvidia field in the deployment file, but I don't quite understand this behavior because the default_container_runtime has already been specified as nvidia in the containerd configuration for each node.
Ask your question
apiVersion: apps/v1 kind: DaemonSet metadata: name: dcgm-exporter namespace: monitoring spec: revisionHistoryLimit: 10 selector: matchLabels: app.kubernetes.io/component: dcgm-exporter app.kubernetes.io/instance: dcgm-exporter app.kubernetes.io/name: dcgm-exporter template: metadata: labels: app.kubernetes.io/component: dcgm-exporter app.kubernetes.io/instance: dcgm-exporter app.kubernetes.io/name: dcgm-exporter namespace: monitoring spec: containers:
================================================================================================= 2024/09/26 04:12:34 maxprocs: Leaving GOMAXPROCS=192: CPU quota undefined 2024-09-26T12:12:34.555993494+08:00 time="2024-09-26T04:12:34Z" level=info msg="Starting dcgm-exporter" 2024-09-26T12:12:34.556005237+08:00 time="2024-09-26T04:12:34Z" level=debug msg="Debug output is enabled" 2024-09-26T12:12:34.556632224+08:00 time="2024-09-26T04:12:34Z" level=debug msg="Command line: /usr/bin/dcgm-exporter -f /etc/dcgm-exporter/dcp-metrics-included.csv" time="2024-09-26T04:12:34Z" level=debug msg="Loaded configuration" dump="&{CollectorsFile:/etc/dcgm-exporter/dcp-metrics-included.csv Address::9400 CollectInterval:30000 Kubernetes:true KubernetesGPUIdType:uid CollectDCP:true UseOldNamespace:false UseRemoteHE:false RemoteHEInfo:localhost:5555 GPUDevices:{Flex:true MajorRange:[] MinorRange:[]} SwitchDevices:{Flex:true MajorRange:[] MinorRange:[]} CPUDevices:{Flex:true MajorRange:[] MinorRange:[]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:[] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:300000 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:300000 EnableDCGMLog:false DCGMLogLevel:NONE PodResourcesKubeletSocket:/var/lib/kubelet/pod-resources/kubelet.sock HPCJobMappingDir: NvidiaResourceNames:[]}" time="2024-09-26T04:12:34Z" level=info msg="DCGM successfully initialized!" 2024-09-26T12:12:34.948180836+08:00 time="2024-09-26T04:12:34Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded" 2024-09-26T12:12:34.948188060+08:00 time="2024-09-26T04:12:34Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'" 2024-09-26T12:12:34.948330476+08:00 time="2024-09-26T04:12:34Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled" 2024-09-26T12:12:34.948341428+08:00 time="2024-09-26T04:12:34Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled" 2024-09-26T12:12:34.948345155+08:00 time="2024-09-26T04:12:34Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled" 2024-09-26T12:12:34.948359042+08:00 time="2024-09-26T04:12:34Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled" time="2024-09-26T04:12:34Z" level=warning msg="Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled" 2024-09-26T12:12:34.948489134+08:00 time="2024-09-26T04:12:34Z" level=info msg="Initializing system entities of type: GPU" time="2024-09-26T04:12:35Z" level=debug msg="System entities of type GPU initialized" 2024-09-26T12:12:35.285853835+08:00 time="2024-09-26T04:12:35Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3" time="2024-09-26T04:12:35Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6" 2024-09-26T12:12:35.285870137+08:00 time="2024-09-26T04:12:35Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7" time="2024-09-26T04:12:35Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8" time="2024-09-26T04:12:35Z" level=debug msg="Counters are initialized" dump="[{FieldID:100 FieldName:DCGM_FI_DEV_SM_CLOCK PromType:gauge Help:SM clock frequency (in MHz).} {FieldID:101 FieldName:DCGM_FI_DEV_MEM_CLOCK PromType:gauge Help:Memory clock frequency (in MHz).} {FieldID:140 FieldName:DCGM_FI_DEV_MEMORY_TEMP PromType:gauge Help:Memory temperature (in C).} {FieldID:150 FieldName:DCGM_FI_DEV_GPU_TEMP PromType:gauge Help:GPU temperature (in C).} {FieldID:155 FieldName:DCGM_FI_DEV_POWER_USAGE PromType:gauge Help:Power draw (in W).} {FieldID:156 FieldName:DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION PromType:counter Help:Total energy consumption since boot (in mJ).} {FieldID:202 FieldName:DCGM_FI_DEV_PCIE_REPLAY_COUNTER PromType:counter Help:Total number of PCIe retries.} {FieldID:203 FieldName:DCGM_FI_DEV_GPU_UTIL PromType:gauge Help:GPU utilization (in %).} {FieldID:204 FieldName:DCGM_FI_DEV_MEM_COPY_UTIL PromType:gauge Help:Memory utilization (in %).} {FieldID:206 FieldName:DCGM_FI_DEV_ENC_UTIL PromType:gauge Help:Encoder utilization (in %).} {FieldID:207 FieldName:DCGM_FI_DEV_DEC_UTIL PromType:gauge Help:Decoder utilization (in %).} {FieldID:230 FieldName:DCGM_FI_DEV_XID_ERRORS PromType:gauge Help:Value of the last XID error encountered.} {FieldID:251 FieldName:DCGM_FI_DEV_FB_FREE PromType:gauge Help:Framebuffer memory free (in MiB).} {FieldID:252 FieldName:DCGM_FI_DEV_FB_USED PromType:gauge Help:Framebuffer memory used (in MiB).} {FieldID:449 FieldName:DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL PromType:counter Help:Total number of NVLink bandwidth counters for all lanes.} {FieldID:526 FieldName:DCGM_FI_DEV_VGPU_LICENSE_STATUS PromType:gauge Help:vGPU License status} {FieldID:393 FieldName:DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for uncorrectable errors} {FieldID:394 FieldName:DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS PromType:counter Help:Number of remapped rows for correctable errors} {FieldID:395 FieldName:DCGM_FI_DEV_ROW_REMAP_FAILURE PromT ype:gauge Help:Whether remapping of rows has failed} {FieldID:1 FieldName:DCGM_FI_DRIVER_VERSION PromType:label Help:Driver Version}]" time="2024-09-26T04:12:35Z" level=info msg="Kubernetes metrics collection enabled!" 2024-09-26T12:12:35.321315631+08:00 time="2024-09-26T04:12:35Z" level=info msg="Pipeline starting" 2024-09-26T12:12:35.321320811+08:00 time="2024-09-26T04:12:35Z" level=info msg="Starting webserver" 2024-09-26T12:12:35.321682302+08:00 time="2024-09-26T04:12:35Z" level=info msg="Listening on" address="[::]:9400" 2024-09-26T12:12:35.321690668+08:00 time="2024-09-26T04:12:35Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
======================================================================================================= As shown above, these are the key parts of the deployment files and error logs. I have enabled debug mode. However, the root cause of the issue has not yet been analyzed. Please help me with this.