eBay / nvidiagpubeat

nvidiagpubeat is an elastic beat that uses NVIDIA System Management Interface (nvidia-smi) to monitor NVIDIA GPU devices and can ingest metrics into Elastic search cluster, with support for both 6.x and 7.x versions of beats. nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
https://github.com/eBay/nvidiagpubeat
Apache License 2.0
54 stars 22 forks source link

I couldn't get the GPU name #27

Closed seungyongshim closed 4 years ago

seungyongshim commented 4 years ago

Excuse, me.

My config is here.

nvidiagpubeat:
  period: 30s
  query: "name,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
  env: "production"

My Result

{
  "index": 3,
  "temperature": {
    "gpu": 33
  },
  "memory": {
    "total": 11172,
    "used": 10
  },
  "utilization": {
    "gpu": 0,
    "memory": 0
  },
  "clocks": {
    "current": {
      "graphics": 0,
      "sm": 0,
      "memory": 0
    }
  },
  "ecs": {
    "version": "1.5.0"
  },
  "power": {
    "limit": 0,
    "draw": 0
  },
  "gpuIndex": 3,
  "gpu_name": 0,
  },
  "count": 4,
  "fan": {
    "speed": 23
  },
  "driver_version": 0,
  "pstate": 8,
  "type": "nvidiagpubeat",
}

gpu_name is 0. power.limit is 0. And another field has incorrected value of 0.

https://github.com/eBay/nvidiagpubeat/blob/a013ac43282923c49179b996fce9c6263b4a0acf/nvidia/gpu.go#L90

Please, fix it. My boss is going to happy.

Best regards.

deepujain commented 4 years ago

@seungyongshim
Thank you for using nvidiagpubeat and reporting the issue. I have fixed and verified it locally and on a nvidia gpu machine. The fix is available for both master ( Beats 6.x ) and withBeats7.3 (Beats 7.x ) branches. Please verify.

The config used for below test is

query: "name,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
  env: "test"

Snippet of result with 6.x Beats, running locally on MacOSx.

export PATH=$PATH:.
./nvidiagpubeat -c nvidiagpubeat.yml -e -d "*" -E seccomp.enabled=false

2020-05-30T21:56:19.664-0700  INFO  instance/beat.go:278  Setup Beat: nvidiagpubeat; Version: 6.5.5
2020-05-30T21:56:19.664-0700  DEBUG [beat]  instance/beat.go:299  Initializing output plugins
2020-05-30T21:56:19.664-0700  DEBUG [processors]  processors/processor.go:66  Processors:
2020-05-30T21:56:19.664-0700  INFO  elasticsearch/client.go:163 Elasticsearch url: http://localhost:9200
2020-05-30T21:56:19.665-0700  DEBUG [publish] pipeline/consumer.go:137  start pipeline event consumer
2020-05-30T21:56:19.666-0700  INFO  [publisher] pipeline/module.go:110  Beat name: LOCAL_HOST_NAME
2020-05-30T21:56:19.666-0700  INFO  [monitoring]  log/log.go:117  Starting metrics logging every 30s
2020-05-30T21:56:19.666-0700  INFO  instance/beat.go:400  nvidiagpubeat start running.
2020-05-30T21:56:19.666-0700  INFO  beater/nvidiagpubeat.go:57  nvidiagpubeat is running for ** test ** environment. ! Hit CTRL-C to stop it.
2020-05-30T21:56:20.667-0700  DEBUG [nvidiagpubeat] nvidia/metrics.go:43  Determine number of gpu cards.
2020-05-30T21:56:20.667-0700  DEBUG [nvidiagpubeat] nvidia/metrics.go:49  Number of gpu cards 4.
2020-05-30T21:56:20.667-0700  INFO  nvidia/gpu.go:57  Running query:  name,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate  with gpuCount 4
2020-05-30T21:56:20.676-0700  DEBUG [nvidiagpubeat] beater/nvidiagpubeat.go:77  Event generated, Attempting to publish to configured output.
2020-05-30T21:56:20.676-0700  DEBUG [publish] pipeline/processor.go:308 Publish event: {
  "@timestamp": "2020-05-31T04:56:20.676Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "pstate": 0,
  "type": "nvidiagpubeat",
  "fan": {
    "speed": "[NotSupported]"
  },
  "name": "Tesla100-PCIE-16GB",
  "memory": {
    "total": 16280,
    "used": 1628
  },
  "power": {
    "draw": 24.8,
    "limit": 250
  },
  "driver_version": "418.87.01",
  "utilization": {
    "gpu": 10,
    "memory": 10
  },
  "gpuIndex": 0,
  "host": {
    "name": "LOCAL_HOST_NAME"
  },
  "count": 4,
  "temperature": {
    "gpu": 25
  },
  "clocks": {
    "sm": 405,
    "mem": 715,
    "gr": 405
  },
  "index": 0,
  "beat": {
    "name": "LOCAL_HOST_NAME",
    "hostname": "LOCAL_HOST_NAME",
    "version": "6.5.5"
  }
}
deepujain commented 4 years ago

@seungyongshim

seungyongshim commented 4 years ago

Thanks.

My R&R was changed. But my ex-boss will be happy by you.