eBay / nvidiagpubeat

nvidiagpubeat is an elastic beat that uses NVIDIA System Management Interface (nvidia-smi) to monitor NVIDIA GPU devices and can ingest metrics into Elastic search cluster, with support for both 6.x and 7.x versions of beats. nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
https://github.com/eBay/nvidiagpubeat
Apache License 2.0
54 stars 22 forks source link

Can nvidiagpubeat be made to also export the process running on each card? #29

Open musiczhzhao opened 3 years ago

musiczhzhao commented 3 years ago

Since nvidiagpubeat is based on nvidia-smi and nvidia-smi is able to list the processes that are currently using the gpu cards, in theory nvidiagpubeat should be able to export the process info as metrics. Please correct me if I am wrong.

I am interest to know if there is any plan to do this? It will be very helpful in identifying the GPU resource usage of processes and the code efficiency.

All the best.

deepujain commented 3 years ago

@musiczhzhao Yes, it can. I had a piece of code for it. I will try and integrate into nvidiagpubeat.

musiczhzhao commented 3 years ago

@deepujain Thank you! 👍

musiczhzhao commented 3 years ago

Hi @deepujain, How are things going? Just to check if there is any update? Any if any help are needed? Best

deepujain commented 3 years ago

The changes are ready. I lost access to my GPU cluster, hence testing the changes has become a challenge and created a dependency. Here is a sample

The --query-gpu will generate below event by nvidiagpubeat.

Publish event: Publish event: {
  "@timestamp": "2021-01-03T07:27:16.080Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "type": "nvidiagpubeat",
  "gpu_uuid": "GPU-b884db58-6340-7969-a79f-b937f3583884",
  "driver_version": "418.87.01",
  "index": 3,
  "gpu_serial": 3.20218176911e+11,
  "memory": {
    "used": 3256,
    "total": 16280
  },
  "name": "Tesla100-PCIE-16GB",
  "host": {
    "name": "AB-SJC-11111111"
  },
  "utilization": {
    "memory": 50,
    "gpu": 50
  },
  "beat": {
    "name": "AB-SJC-11111111",
    "hostname": "AB-SJC-11111111",
    "version": "6.5.5"
  },
  "pstate": 0,
  "gpu_bus_id": "00000000:19:00.0",
  "count": 4,
  "fan": {
    "speed": "[NotSupported]"
  },
  "gpuIndex": 3,
  "power": {
    "draw": 25.28,
    "limit": 250
  },
  "temperature": {
    "gpu": 24
  },
  "clocks": {
    "gr": 405,
    "sm": 405,
    "mem": 715
  }
}

The --query-compute-apps will generate below event by nvidiagpubeat.

Publish event: {
  "@timestamp": "2021-01-03T07:29:53.633Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "pid": 222414,
  "process_name": "python",
  "used_gpu_memory": 10,
  "gpu_bus_id": "00000000:19:00.0",
  "gpu_serial": 3.20218176911e+11,
  "beat": {
    "name": "AB-SJC-11111111",
    "hostname": "AB-SJC-11111111",
    "version": "6.5.5"
  },
  "gpu_name": "Tesla100-PCIE-16GB",
  "used_memory": 15,
  "gpuIndex": 3,
  "type": "nvidiagpubeat",
  "gpu_uuid": "GPU-b884db58-6340-7969-a79f-b937f3583884",
  "host": {
    "name": "LM-SJC-11004865"
  }
}
deepujain commented 3 years ago

@musiczhzhao I made the changes to nvidiagpubeat to support process details information and made it generic in the process. Please test and share the results here (including few sample events) for query-compute-apps (active GPU process details) .

It can now support all types of queries as it is generic. I have tested only --query-gpu and --query-compute-apps. In case you plan to use other options, let me know and you can help me with testing.

nvidia-smi -h

  SELECTIVE QUERY OPTIONS:

    Allows the caller to pass an explicit list of properties to query.

    [one of]

    --query-gpu=                Information about GPU.
                                Call --help-query-gpu for more info.
    --query-supported-clocks=   List of supported clocks.
                                Call --help-query-supported-clocks for more info.
    --query-compute-apps=       List of currently active compute processes.
                                Call --help-query-compute-apps for more info.
    --query-accounted-apps=     List of accounted compute processes.
                                Call --help-query-accounted-apps for more info.
    --query-retired-pages=      List of device memory pages that have been retired.
                                Call --help-query-retired-pages for more info.

https://github.com/eBay/nvidiagpubeat#sample-event has details.

deepujain commented 3 years ago

@musiczhzhao

musiczhzhao commented 3 years ago

Hi @deepujain, Thank you! I will test it and get back to you ASAP. 👍

Best

musiczhzhao commented 3 years ago

Hello @deepujain,

Happy weekend!

I have briefly tested the new version and confirm it can export the application name and gpu memory usage of the application when --query-compute-apps is used.

One question have is if there is a way to enable both --query-gpu and --query-compute-apps so both documents can be exported. I tried to enable both in the configuration file and it turned out only the later one become effective.

For example, with following in configuration, it seems only export the compute app metrics:


# --query-gpu will provide information about GPU.

query: "--query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"

# --query-compute-apps will list currently active compute processes.

query: "--query-compute-apps=gpu_name,gpu_bus_id,gpu_serial,gpu_uuid,pid,process_name,used_gpu_memory,used_memory"


Another question is we find it useful to have the full command line of the app. For example, if a python script is launched with python, current nvidia-smi will just show app as python, without the actual script name and arguments. Searching around from online we found what people generally do it to firstly get the pid of the application and then get the ful command from ps command. (https://stackoverflow.com/questions/50264491/how-to-customize-nvidia-smi-s-output-to-show-pid-username) Can we have this build-in so it can have the cmd just as metricbeat does?

Best, Zhao

deepujain commented 3 years ago

Hello Zhao,

Thank you for testing out. Please share sample events for both the queries --query-compute-apps and --query-gpu. It will help me update the documentation with real events. I can then close this issue as the current code seems to have met the expectation of Issue #29 .

Could you please raise seperate github issues for each new feature request.

  1. One question have is if there is a way to enable both --query-gpu and --query-compute-apps so both documents can be exported. I tried to enable both in the configuration file and it turned out only the later one become effective. Please share expected sample events of a combined query "--query-compute-apps-and--query-gpu"

  2. Enriched version for "--query-compute-apps" to get additional details of process. Searching around from online we found what people generally do it to firstly get the pid of the application and then get the ful command from ps command. (https://stackoverflow.com/questions/50264491/how-to-customize-nvidia-smi-s-output-to-show-pid-username)

Cheers Deepak

musiczhzhao commented 3 years ago

Hi @deepujain,

I did a bit more testing which took some time.

Another issue we found is that the new version seems assume there is only one app running on each GPU card, or nvidia-smi only return 4 processes if there are 4 GPU cards on a machine. Otherwise it will crash with following error message.

2021-01-26T12:00:20.226-0600 INFO runtime/panic.go:975 nvidiagpubeat stopped. 2021-01-26T12:00:20.259-0600 FATAL [nvidiagpubeat] instance/beat.go:154 Failed due to panic. {"panic": "runtime error: index out of range [4] with length 4", "stack": "github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance.Run.func1.1\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance/beat.go:155\nruntime.gopanic\n\t/s0/Compilers/go/go1.14.6/src/runtime/panic.go:969\nruntime.goPanicIndex\n\t/s0/Compilers/go/go1.14.6/src/runtime/panic.go:88\ngithub.com/ebay/nvidiagpubeat/nvidia.Utilization.run\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/nvidia/gpu.go:122\ngithub.com/ebay/nvidiagpubeat/nvidia.Metrics.Get\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/nvidia/metrics.go:52\ngithub.com/ebay/nvidiagpubeat/beater.(*Nvidiagpubeat).Run\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/beater/nvidiagpubeat.go:73\ngithub.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance. ...

The code allocating the event is in line 71 of nvidia/gpu.go: events := make([]common.MapStr, gpuCount, 2*gpuCount)

I will attached the sample events in a separate post.

Best, Zhao