Open aLeX1443 opened 3 years ago
A temporary workaround seems to be to remove the driver_version
from the configuration file, i.e.,
nvidiagpubeat:
period: 1s
query: "--query-gpu=name,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
env: "dev"
output.elasticsearch:
hosts: "${ELASTICSEARCH_HOSTS}"
Hello Alex, Thank you for raising the issue along with workaround.
Please share the output of below query that has driver_version on your NVIDIA GPU.
nvidia-smi --query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate
This way, i can re-create the issue with nvidiagpubeat/nvidiasmilocal/localnvidiasmi.go and provide a fix.
Cheers Deepak
Hello Alex, @aLeX1443
I am unable to re-create the issue.
nvidigpubeat.yml has driver_version and looks like below
nvidiagpubeat:
# Defines how often an event is sent to the output
period: 1s
# By default the query of type query-gpu is executed to support backward compatibility
# query: "name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
# A generic version of query is supported by nvidiagpubeat for query options like --query-gpu,--query-compute-apps and others.
# -query-gpu will provide information about GPU.
query: "--query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
# --query-compute-apps will list currently active compute processes.
# query: "--query-compute-apps=gpu_name,gpu_bus_id,gpu_serial,gpu_uuid,pid,process_name,used_gpu_memory,used_memory"
env: "test"
# env can be test or production. test is for test purposes to evaluate funcationality of this beat. Switch to production
The output of above query on my real GPU is
nvidia-smi --query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,util
ization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate --format=csv
name, pci.bus_id, serial, uuid, driver_version, count, index, fan.speed [%], memory.total [MiB], memory.used [MiB], utilization.gpu [%], utilization.memory [%], temperature.gpu, power.draw [W], power.li
mit [W], clocks.current.graphics [MHz], clocks.current.sm [MHz], clocks.current.memory [MHz], pstate
Tesla P100-PCIE-16GB, 00000000:08:00.0, 1234567890123, GPU-xxx75xxx-xxxx-xxx-xxxx-1234567890ab, 418.87.00, 1, 0, [Not Supported], 16280 MiB, 0 MiB, 0 %, 0 %, 28, 26.02 W, 250.00 W, 405 MHz, 405 MHz, 71
5 MHz, P0
Here the driver_version has two dots 418.87.00
The resulting output from nvidigpubeat is
2021-01-16T21:02:27.389-0800 DEBUG [publish] pipeline/processor.go:308 Publish event: {
"@timestamp": "2021-01-17T05:02:27.388Z",
"@metadata": {
"beat": "nvidiagpubeat",
"type": "doc",
"version": "6.5.5"
},
"driver_version": "418.87.00",
"beat": {
"name": "AA-ABC-11111111",
"hostname": "AA-ABC-11111111",
"version": "6.5.5"
},
"temperature": {
"gpu": 28
},
"pstate": 0,
"power": {
"draw": 26.02,
"limit": 250
},
"gpu_serial": 1234567890123,
"name": "Tesla100-PCIE-16GB",
"utilization": {
"gpu": 0,
"memory": 0
},
"index": 0,
"fan": {
"speed": "[NotSupported]"
},
"gpu_uuid": "GPU-xxx75xxx-xxxx-xxx-xxxx-1234567890ab",
"host": {
"name": "AA-ABC-11111111"
},
"gpu_bus_id": "00000000:08:00.0",
"gpuIndex": 0,
"memory": {
"total": 16280,
"used": 0
},
"count": 1,
"clocks": {
"sm": 405,
"mem": 715,
"gr": 405
},
"type": "nvidiagpubeat"
}
Each field correctly maps the CSV output from nvidia-smi command.
Cheers Deepak
Hi @deepujain, here is the output of: nvidia-smi --query gpu=driver_version,name,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate --format=csv
460.32.03, GeForce RTX 3090, 2, 0, 30 %, 24265 MiB, 10275 MiB, 0 %, 10 %, 28, 10.15 W, 350.00 W, 210 MHz, 210 MHz, 405 MHz, P8
460.32.03, GeForce RTX 3090, 2, 1, 30 %, 24268 MiB, 2846 MiB, 0 %, 0 %, 25, 8.39 W, 350.00 W, 0 MHz, 0 MHz, 405 MHz, P8
The driver version is the one installed by the Ubuntu Additional Drivers aplication. Would it be possible to test it with the same driver version? i.e., 460.32.03
@aLeX1443 I used the output that you shared and ingested into nvidiagpubeat/nvidiasmilocal/localnvidiasmi.go . I ran nvidigpubeat (master
branch) in local mode and i am able to get the events published correctly.
Publish event: {
"@timestamp": "2021-01-17T15:34:57.213Z",
"@metadata": {
"beat": "nvidiagpubeat",
"type": "doc",
"version": "6.5.5"
},
"index": 0,
"utilization": {
"gpu": 0,
"memory": 10
},
"temperature": {
"gpu": 28
},
"host": {
"name": "AA-ABC-11111111"
},
"gpuIndex": 0,
"power": {
"draw": 10.15,
"limit": 350
},
"pstate": 8,
"clocks": {
"gr": 210,
"sm": 210,
"mem": 405
},
"beat": {
"name": "AA-ABC-11111111",
"hostname": "AA-ABC-11111111",
"version": "6.5.5"
},
"driver_version": "460.32.03",
"type": "nvidiagpubeat",
"name": "GeForceRTX3090",
"count": 2,
"memory": {
"total": 24265,
"used": 10275
},
"fan": {
"speed": 30
}
}
2021-01-17T07:34:57.213-0800 DEBUG [publish] pipeline/processor.go:308 Publish event: {
"@timestamp": "2021-01-17T15:34:57.213Z",
"@metadata": {
"beat": "nvidiagpubeat",
"type": "doc",
"version": "6.5.5"
},
"power": {
"draw": 8.39,
"limit": 350
},
"gpuIndex": 1,
"driver_version": "460.32.03",
"name": "GeForceRTX3090",
"utilization": {
"gpu": 0,
"memory": 0
},
"type": "nvidiagpubeat",
"count": 2,
"pstate": 8,
"fan": {
"speed": 30
},
"index": 1,
"temperature": {
"gpu": 25
},
"host": {
"name": "AA-ABC-11111111"
},
"clocks": {
"gr": 0,
"sm": 0,
"mem": 405
},
"beat": {
"name": "AA-ABC-11111111",
"hostname": "AA-ABC-11111111",
"version": "6.5.5"
},
"memory": {
"total": 24268,
"used": 2846
}
}
What error do you see with nvidiagpubeat ? What branch are you using with nvidiagpubeat (master or withBeats7.3) ?
I do not have the flexibility to modify the driver version of GPUs on the cluster.
The output that you shared here https://github.com/eBay/nvidiagpubeat/issues/32#issue-785417368 (description). I think nvidiagpubeat is able to understand the driver_version with multiple points and create the event for ES to consume. However it appears ES was not able to ingest it.
I see this in the event that you shared.
"driver_version":"460.32.03"
I was using branch withBeats7.3
. I'll test it out with master
once I get the chance
I found it to work with branch withBeats7.3
2021-01-17T09:07:00.947-0800 INFO nvidia/gpu.go:68 Running command localnvidiasmi for query: --query-gpu=driver_version,name,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate with gpuCount 4
2021-01-17T09:07:01.099-0800 DEBUG [nvidiagpubeat] beater/nvidiagpubeat.go:77 Event generated, Attempting to publish to configured output.
2021-01-17T09:07:01.099-0800 DEBUG [processors] processing/processors.go:183 Publish event: {
processing/processors.go:183 Publish event: {
"@timestamp": "2021-01-17T17:07:01.100Z",
"@metadata": {
"beat": "nvidiagpubeat",
"type": "_doc",
"version": "7.3.3"
},
"pstate": 8,
"host": {
"name": "AA-ABC-11111111"
},
"count": 2,
"memory": {
"used": 2846,
"total": 24268
},
"power": {
"draw": 8.39,
"limit": 350
},
"clocks": {
"mem": 405,
"gr": 0,
"sm": 0
},
"driver_version": "460.32.03",
"agent": {
"ephemeral_id": "071268ff-b8b3-44c4-bbbd-b378a2d26707",
"hostname": "AA-ABC-11111111",
"id": "9ebd65ba-4f83-4772-b361-98415432dee4",
"version": "7.3.3",
"type": "nvidiagpubeat"
},
"name": "GeForceRTX3090",
"index": 1,
"fan": {
"speed": 30
},
"ecs": {
"version": "1.0.1"
},
"temperature": {
"gpu": 25
},
"gpuIndex": 1,
"type": "nvidiagpubeat",
"utilization": {
"memory": 0,
"gpu": 0
}
}```
^C2021-01-17T09:07:01.601-0800 DEBUG [service] service/service.go:53 Received sigterm/sigint, stopping 2021-01-17T09:07:01.602-0800 DEBUG [publisher] pipeline/client.go:149 client: closing acker 2021-01-17T09:07:01.602-0800 DEBUG [publisher] pipeline/client.go:151 client: done closing acker 2021-01-17T09:07:01.602-0800 DEBUG [publisher] pipeline/client.go:155 client: cancelled 0 events 2021-01-17T09:07:01.606-0800 INFO [monitoring] log/log.go:153 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":212,"time":{"ms":212}},"total":{"ticks":242,"time":{"ms":242},"value":242},"user":{"ticks":30,"time":{"ms":30}}},"info":{"ephemeral_id":"071268ff-b8b3-44c4-bbbd-b378a2d26707","uptime":{"ms":1685}},"memstats":{"gc_next":4194304,"memory_alloc":2049392,"memory_total":3658104,"rss":15740928},"runtime":{"goroutines":8}},"libbeat":{"config":{"module":{"running":0}},"output":{"type":"elasticsearch"},"pipeline":{"clients":0,"events":{"active":2,"published":2,"total":2}}},"system":{"cpu":{"cores":8},"load":{"1":7.4155,"15":3.1431,"5":4.5801,"norm":{"1":0.9269,"15":0.3929,"5":0.5725}}}}}} 2021-01-17T09:07:01.606-0800 INFO [monitoring] log/log.go:154 Uptime: 1.689151734s 2021-01-17T09:07:01.606-0800 INFO [monitoring] log/log.go:131 Stopping metrics logging. 2021-01-17T09:07:01.606-0800 INFO instance/beat.go:432 nvidiagpubeat stopped.
I will wait for results from your testing with branch `withBeats7.3`
@aLeX1443 Did you get a chance to look into it ?
I believe the current version is not compatible with Nvidia driver version 460.32.03, due to it having two dots in the name.
Please see the end of the line below: