eBay / nvidiagpubeat

nvidiagpubeat is an elastic beat that uses NVIDIA System Management Interface (nvidia-smi) to monitor NVIDIA GPU devices and can ingest metrics into Elastic search cluster, with support for both 6.x and 7.x versions of beats. nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
https://github.com/eBay/nvidiagpubeat
Apache License 2.0
54 stars 22 forks source link

Incompatible with driver version 460.32.03 (because of the two dots) #32

Open aLeX1443 opened 3 years ago

aLeX1443 commented 3 years ago

I believe the current version is not compatible with Nvidia driver version 460.32.03, due to it having two dots in the name.

Please see the end of the line below:

2021-01-13T20:22:02.494Z    WARN    elasticsearch/client.go:535 Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xbff7f37a9d203e60, ext:32072702120, loc:(*time.Location)(0x2090e20)}, Meta:common.MapStr(nil), Fields:common.MapStr{"agent":common.MapStr{"ephemeral_id":"fd12b93b-9db1-4e24-9e0d-747229464c00", "hostname":"dca939cfb9c2", "id":"33c198bd-3989-4a66-9683-fe258efbe53b", "type":"nvidiagpubeat", "version":"7.3.3"}, "clocks":common.MapStr{"gr":0, "mem":405, "sm":0}, "count":2, "driver_version":"460.32.03", "ecs":common.MapStr{"version":"1.0.1"}, "fan":common.MapStr{"speed":30}, "gpuIndex":1, "host":common.MapStr{"name":"dca939cfb9c2"}, "index":1, "memory":common.MapStr{"total":24268, "used":5942}, "name":"GeForceRTX3090", "power":common.MapStr{"draw":7.14, "limit":350}, "pstate":8, "temperature":common.MapStr{"gpu":28}, "type":"nvidiagpubeat", "utilization":common.MapStr{"gpu":0, "memory":0}}, Private:interface {}(nil), TimeSeries:false}, Flags:0x0} (status=400): {"type":"mapper_parsing_exception","reason":"failed to parse field [driver_version] of type [float] in document with id 'tvFp_XYBFpUHIuKQj2b6'. Preview of field's value: '460.32.03'","caused_by":{"type":"number_format_exception","reason":"multiple points"}}
aLeX1443 commented 3 years ago

A temporary workaround seems to be to remove the driver_version from the configuration file, i.e.,

nvidiagpubeat:
  period: 1s
  query: "--query-gpu=name,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
  env: "dev"

output.elasticsearch:
  hosts: "${ELASTICSEARCH_HOSTS}"
deepujain commented 3 years ago

Hello Alex, Thank you for raising the issue along with workaround.

Please share the output of below query that has driver_version on your NVIDIA GPU.

nvidia-smi --query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate

This way, i can re-create the issue with nvidiagpubeat/nvidiasmilocal/localnvidiasmi.go and provide a fix.

Cheers Deepak

deepujain commented 3 years ago

Hello Alex, @aLeX1443

I am unable to re-create the issue.

nvidigpubeat.yml has driver_version and looks like below

nvidiagpubeat:
  # Defines how often an event is sent to the output
  period: 1s
  # By default the query of type query-gpu is executed to support backward compatibility
  # query: "name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
  # A generic version of query is supported by nvidiagpubeat for query options like --query-gpu,--query-compute-apps and others.
  # -query-gpu will provide information about GPU.
  query: "--query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
  # --query-compute-apps will list currently active compute processes.
  # query: "--query-compute-apps=gpu_name,gpu_bus_id,gpu_serial,gpu_uuid,pid,process_name,used_gpu_memory,used_memory"
  env: "test"
  # env can be test or production. test is for test purposes to evaluate funcationality of this beat. Switch to production

The output of above query on my real GPU is

nvidia-smi --query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,util
ization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate --format=csv
name, pci.bus_id, serial, uuid, driver_version, count, index, fan.speed [%], memory.total [MiB], memory.used [MiB], utilization.gpu [%], utilization.memory [%], temperature.gpu, power.draw [W], power.li
mit [W], clocks.current.graphics [MHz], clocks.current.sm [MHz], clocks.current.memory [MHz], pstate

Tesla P100-PCIE-16GB, 00000000:08:00.0, 1234567890123, GPU-xxx75xxx-xxxx-xxx-xxxx-1234567890ab, 418.87.00, 1, 0, [Not Supported], 16280 MiB, 0 MiB, 0 %, 0 %, 28, 26.02 W, 250.00 W, 405 MHz, 405 MHz, 71
5 MHz, P0

Here the driver_version has two dots 418.87.00

The resulting output from nvidigpubeat is

2021-01-16T21:02:27.389-0800    DEBUG   [publish]   pipeline/processor.go:308   Publish event: {
  "@timestamp": "2021-01-17T05:02:27.388Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "driver_version": "418.87.00",
  "beat": {
    "name": "AA-ABC-11111111",
    "hostname": "AA-ABC-11111111",
    "version": "6.5.5"
  },
  "temperature": {
    "gpu": 28
  },
  "pstate": 0,
  "power": {
    "draw": 26.02,
    "limit": 250
  },
  "gpu_serial": 1234567890123,
  "name": "Tesla100-PCIE-16GB",
  "utilization": {
    "gpu": 0,
    "memory": 0
  },
  "index": 0,
  "fan": {
    "speed": "[NotSupported]"
  },
  "gpu_uuid": "GPU-xxx75xxx-xxxx-xxx-xxxx-1234567890ab",
  "host": {
    "name": "AA-ABC-11111111"
  },
  "gpu_bus_id": "00000000:08:00.0",
  "gpuIndex": 0,
  "memory": {
    "total": 16280,
    "used": 0
  },
  "count": 1,
  "clocks": {
    "sm": 405,
    "mem": 715,
    "gr": 405
  },
  "type": "nvidiagpubeat"
}

Each field correctly maps the CSV output from nvidia-smi command.

Cheers Deepak

aLeX1443 commented 3 years ago

Hi @deepujain, here is the output of: nvidia-smi --query gpu=driver_version,name,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate --format=csv

460.32.03, GeForce RTX 3090, 2, 0, 30 %, 24265 MiB, 10275 MiB, 0 %, 10 %, 28, 10.15 W, 350.00 W, 210 MHz, 210 MHz, 405 MHz, P8
460.32.03, GeForce RTX 3090, 2, 1, 30 %, 24268 MiB, 2846 MiB, 0 %, 0 %, 25, 8.39 W, 350.00 W, 0 MHz, 0 MHz, 405 MHz, P8

The driver version is the one installed by the Ubuntu Additional Drivers aplication. Would it be possible to test it with the same driver version? i.e., 460.32.03

deepujain commented 3 years ago

@aLeX1443 I used the output that you shared and ingested into nvidiagpubeat/nvidiasmilocal/localnvidiasmi.go . I ran nvidigpubeat (master branch) in local mode and i am able to get the events published correctly.

Publish event: {
  "@timestamp": "2021-01-17T15:34:57.213Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "index": 0,
  "utilization": {
    "gpu": 0,
    "memory": 10
  },
  "temperature": {
    "gpu": 28
  },
  "host": {
    "name": "AA-ABC-11111111"
  },
  "gpuIndex": 0,
  "power": {
    "draw": 10.15,
    "limit": 350
  },
  "pstate": 8,
  "clocks": {
    "gr": 210,
    "sm": 210,
    "mem": 405
  },
  "beat": {
    "name": "AA-ABC-11111111",
    "hostname": "AA-ABC-11111111",
    "version": "6.5.5"
  },
  "driver_version": "460.32.03",
  "type": "nvidiagpubeat",
  "name": "GeForceRTX3090",
  "count": 2,
  "memory": {
    "total": 24265,
    "used": 10275
  },
  "fan": {
    "speed": 30
  }
}
2021-01-17T07:34:57.213-0800    DEBUG   [publish]   pipeline/processor.go:308   Publish event: {
  "@timestamp": "2021-01-17T15:34:57.213Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "power": {
    "draw": 8.39,
    "limit": 350
  },
  "gpuIndex": 1,
  "driver_version": "460.32.03",
  "name": "GeForceRTX3090",
  "utilization": {
    "gpu": 0,
    "memory": 0
  },
  "type": "nvidiagpubeat",
  "count": 2,
  "pstate": 8,
  "fan": {
    "speed": 30
  },
  "index": 1,
  "temperature": {
    "gpu": 25
  },
  "host": {
    "name": "AA-ABC-11111111"
  },
  "clocks": {
    "gr": 0,
    "sm": 0,
    "mem": 405
  },
  "beat": {
    "name": "AA-ABC-11111111",
    "hostname": "AA-ABC-11111111",
    "version": "6.5.5"
  },
  "memory": {
    "total": 24268,
    "used": 2846
  }
}

What error do you see with nvidiagpubeat ? What branch are you using with nvidiagpubeat (master or withBeats7.3) ?

deepujain commented 3 years ago

I do not have the flexibility to modify the driver version of GPUs on the cluster.

deepujain commented 3 years ago

The output that you shared here https://github.com/eBay/nvidiagpubeat/issues/32#issue-785417368 (description). I think nvidiagpubeat is able to understand the driver_version with multiple points and create the event for ES to consume. However it appears ES was not able to ingest it.

I see this in the event that you shared. "driver_version":"460.32.03"

aLeX1443 commented 3 years ago

I was using branch withBeats7.3. I'll test it out with master once I get the chance

deepujain commented 3 years ago

I found it to work with branch withBeats7.3

2021-01-17T09:07:00.947-0800    INFO    nvidia/gpu.go:68    Running command localnvidiasmi for query:  --query-gpu=driver_version,name,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate  with gpuCount 4
2021-01-17T09:07:01.099-0800    DEBUG   [nvidiagpubeat] beater/nvidiagpubeat.go:77  Event generated, Attempting to publish to configured output.
2021-01-17T09:07:01.099-0800    DEBUG   [processors]    processing/processors.go:183    Publish event: {
processing/processors.go:183    Publish event: {
  "@timestamp": "2021-01-17T17:07:01.100Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "_doc",
    "version": "7.3.3"
  },
  "pstate": 8,
  "host": {
    "name": "AA-ABC-11111111"
  },
  "count": 2,
  "memory": {
    "used": 2846,
    "total": 24268
  },
  "power": {
    "draw": 8.39,
    "limit": 350
  },
  "clocks": {
    "mem": 405,
    "gr": 0,
    "sm": 0
  },
  "driver_version": "460.32.03",
  "agent": {
    "ephemeral_id": "071268ff-b8b3-44c4-bbbd-b378a2d26707",
    "hostname": "AA-ABC-11111111",
    "id": "9ebd65ba-4f83-4772-b361-98415432dee4",
    "version": "7.3.3",
    "type": "nvidiagpubeat"
  },
  "name": "GeForceRTX3090",
  "index": 1,
  "fan": {
    "speed": 30
  },
  "ecs": {
    "version": "1.0.1"
  },
  "temperature": {
    "gpu": 25
  },
  "gpuIndex": 1,
  "type": "nvidiagpubeat",
  "utilization": {
    "memory": 0,
    "gpu": 0
  }
}```

^C2021-01-17T09:07:01.601-0800 DEBUG [service] service/service.go:53 Received sigterm/sigint, stopping 2021-01-17T09:07:01.602-0800 DEBUG [publisher] pipeline/client.go:149 client: closing acker 2021-01-17T09:07:01.602-0800 DEBUG [publisher] pipeline/client.go:151 client: done closing acker 2021-01-17T09:07:01.602-0800 DEBUG [publisher] pipeline/client.go:155 client: cancelled 0 events 2021-01-17T09:07:01.606-0800 INFO [monitoring] log/log.go:153 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":212,"time":{"ms":212}},"total":{"ticks":242,"time":{"ms":242},"value":242},"user":{"ticks":30,"time":{"ms":30}}},"info":{"ephemeral_id":"071268ff-b8b3-44c4-bbbd-b378a2d26707","uptime":{"ms":1685}},"memstats":{"gc_next":4194304,"memory_alloc":2049392,"memory_total":3658104,"rss":15740928},"runtime":{"goroutines":8}},"libbeat":{"config":{"module":{"running":0}},"output":{"type":"elasticsearch"},"pipeline":{"clients":0,"events":{"active":2,"published":2,"total":2}}},"system":{"cpu":{"cores":8},"load":{"1":7.4155,"15":3.1431,"5":4.5801,"norm":{"1":0.9269,"15":0.3929,"5":0.5725}}}}}} 2021-01-17T09:07:01.606-0800 INFO [monitoring] log/log.go:154 Uptime: 1.689151734s 2021-01-17T09:07:01.606-0800 INFO [monitoring] log/log.go:131 Stopping metrics logging. 2021-01-17T09:07:01.606-0800 INFO instance/beat.go:432 nvidiagpubeat stopped.



I will wait for results from your testing with branch `withBeats7.3`
deepujain commented 2 years ago

@aLeX1443 Did you get a chance to look into it ?