Closed seungyongshim closed 4 years ago
@seungyongshim
Thank you for using nvidiagpubeat and reporting the issue. I have fixed and verified it locally and on a nvidia gpu machine. The fix is available for both master ( Beats 6.x ) and withBeats7.3 (Beats 7.x ) branches. Please verify.
The config used for below test is
query: "name,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
env: "test"
Snippet of result with 6.x Beats, running locally on MacOSx.
export PATH=$PATH:.
./nvidiagpubeat -c nvidiagpubeat.yml -e -d "*" -E seccomp.enabled=false
2020-05-30T21:56:19.664-0700 INFO instance/beat.go:278 Setup Beat: nvidiagpubeat; Version: 6.5.5
2020-05-30T21:56:19.664-0700 DEBUG [beat] instance/beat.go:299 Initializing output plugins
2020-05-30T21:56:19.664-0700 DEBUG [processors] processors/processor.go:66 Processors:
2020-05-30T21:56:19.664-0700 INFO elasticsearch/client.go:163 Elasticsearch url: http://localhost:9200
2020-05-30T21:56:19.665-0700 DEBUG [publish] pipeline/consumer.go:137 start pipeline event consumer
2020-05-30T21:56:19.666-0700 INFO [publisher] pipeline/module.go:110 Beat name: LOCAL_HOST_NAME
2020-05-30T21:56:19.666-0700 INFO [monitoring] log/log.go:117 Starting metrics logging every 30s
2020-05-30T21:56:19.666-0700 INFO instance/beat.go:400 nvidiagpubeat start running.
2020-05-30T21:56:19.666-0700 INFO beater/nvidiagpubeat.go:57 nvidiagpubeat is running for ** test ** environment. ! Hit CTRL-C to stop it.
2020-05-30T21:56:20.667-0700 DEBUG [nvidiagpubeat] nvidia/metrics.go:43 Determine number of gpu cards.
2020-05-30T21:56:20.667-0700 DEBUG [nvidiagpubeat] nvidia/metrics.go:49 Number of gpu cards 4.
2020-05-30T21:56:20.667-0700 INFO nvidia/gpu.go:57 Running query: name,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate with gpuCount 4
2020-05-30T21:56:20.676-0700 DEBUG [nvidiagpubeat] beater/nvidiagpubeat.go:77 Event generated, Attempting to publish to configured output.
2020-05-30T21:56:20.676-0700 DEBUG [publish] pipeline/processor.go:308 Publish event: {
"@timestamp": "2020-05-31T04:56:20.676Z",
"@metadata": {
"beat": "nvidiagpubeat",
"type": "doc",
"version": "6.5.5"
},
"pstate": 0,
"type": "nvidiagpubeat",
"fan": {
"speed": "[NotSupported]"
},
"name": "Tesla100-PCIE-16GB",
"memory": {
"total": 16280,
"used": 1628
},
"power": {
"draw": 24.8,
"limit": 250
},
"driver_version": "418.87.01",
"utilization": {
"gpu": 10,
"memory": 10
},
"gpuIndex": 0,
"host": {
"name": "LOCAL_HOST_NAME"
},
"count": 4,
"temperature": {
"gpu": 25
},
"clocks": {
"sm": 405,
"mem": 715,
"gr": 405
},
"index": 0,
"beat": {
"name": "LOCAL_HOST_NAME",
"hostname": "LOCAL_HOST_NAME",
"version": "6.5.5"
}
}
@seungyongshim
Thanks.
My R&R was changed. But my ex-boss will be happy by you.
Excuse, me.
My config is here.
My Result
gpu_name is 0. power.limit is 0. And another field has incorrected value of 0.
https://github.com/eBay/nvidiagpubeat/blob/a013ac43282923c49179b996fce9c6263b4a0acf/nvidia/gpu.go#L90
Please, fix it. My boss is going to happy.
Best regards.