eBay / nvidiagpubeat

nvidiagpubeat is an elastic beat that uses NVIDIA System Management Interface (nvidia-smi) to monitor NVIDIA GPU devices and can ingest metrics into Elastic search cluster, with support for both 6.x and 7.x versions of beats. nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
https://github.com/eBay/nvidiagpubeat
Apache License 2.0
54 stars 22 forks source link

Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed #37

Closed anaconda2196 closed 2 years ago

anaconda2196 commented 3 years ago

On Gpu Machine, I have installed driver successfully! OS: Sles15sp2 Kubernetes Version - v1.18.6 containerD runtime

nvidia-smi
Thu Sep 23 14:19:02 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P4000        Off  | 00000000:D8:00.0 Off |                  N/A |
| 46%   32C    P8     5W / 105W |      0MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
nerdctl run --rm  --gpus all docker.io/nvidia/cuda:11.0-base  nvidia-smi 
Thu Sep 23 21:19:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P4000        Off  | 00000000:D8:00.0 Off |                  N/A |
| 46%   32C    P8     5W / 105W |      0MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvidiagpubeat pod running but not collecting metrics for that GPU machine!

kubectl -n kube-system logs nvidiagpubeat-qmsp6

2021-09-23T21:12:55.185Z    INFO    instance/beat.go:592    Home path: [/usr/share/nvidiagpubeat] Config path: [/usr/share/nvidiagpubeat/keystore] Data path: [/usr/share/nvidiagpubeat/data] Logs path: [/usr/share/nvidiagpubeat/logs]
2021-09-23T21:12:55.185Z    INFO    instance/beat.go:599    Beat UUID: 246846e6-6a14-47ec-94fb-d74ce0825d73
2021-09-23T21:12:55.185Z    INFO    [beat]  instance/beat.go:825    Beat info   {"system_info": {"beat": {"path": {"config": "/usr/share/nvidiagpubeat/keystore", "data": "/usr/share/nvidiagpubeat/data", "home": "/usr/share/nvidiagpubeat", "logs": "/usr/share/nvidiagpubeat/logs"}, "type": "nvidiagpubeat", "uuid": "246846e6-6a14-47ec-94fb-d74ce0825d73"}}}
2021-09-23T21:12:55.185Z    INFO    [beat]  instance/beat.go:834    Build info  {"system_info": {"build": {"commit": "unknown", "libbeat": "6.5.5", "time": "1754-08-30T22:43:41.128Z", "version": "6.5.5"}}}
2021-09-23T21:12:55.185Z    INFO    [beat]  instance/beat.go:837    Go runtime info {"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":48,"version":"go1.12.5"}}}
2021-09-23T21:12:55.188Z    INFO    [beat]  instance/beat.go:841    Host info   {"system_info": {"host": {"architecture":"x86_64","boot_time":"2021-09-23T19:10:10Z","containerized":true,"name":"MY_VM","ip":["127.0.0.1/8","16.0.9.206/23","10.4.0.1/24","10.192.0.128/32"],"kernel_version":"5.3.18-22-default","mac":["b8:83:03:6c:87:58","b8:83:03:6c:87:59","86:53:55:aa:03:41","ee:ee:ee:ee:ee:ee","ee:ee:ee:ee:ee:ee","06:1e:08:04:ee:33","ee:ee:ee:ee:ee:ee"],"os":{"family":"redhat","platform":"centos","name":"CentOS Linux","version":"7 (Core)","major":7,"minor":9,"patch":2009,"codename":"Core"},"timezone":"UTC","timezone_offset_sec":0,"id":"6e5dfda404614386a9ad9543eec57cf0"}}}
2021-09-23T21:12:55.188Z    INFO    [beat]  instance/beat.go:870    Process info    {"system_info": {"process": {"capabilities": {"inheritable":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"permitted":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"effective":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"bounding":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"ambient":null}, "cwd": "/usr/share/nvidiagpubeat", "exe": "/usr/share/nvidiagpubeat/nvidiagpubeat", "name": "nvidiagpubeat", "pid": 1, "ppid": 0, "seccomp": {"mode":"disabled","no_new_privs":false}, "start_time": "2021-09-23T21:12:54.360Z"}}}
2021-09-23T21:12:55.189Z    INFO    instance/beat.go:278    Setup Beat: nvidiagpubeat; Version: 6.5.5
2021-09-23T21:12:55.189Z    INFO    elasticsearch/client.go:163 Elasticsearch url: https://16.0.14.117:9210
2021-09-23T21:12:55.189Z    INFO    [publisher] pipeline/module.go:110  Beat name: MY_VM
2021-09-23T21:12:55.189Z    INFO    instance/beat.go:400    nvidiagpubeat start running.
2021-09-23T21:12:55.189Z    INFO    [monitoring]    log/log.go:117  Starting metrics logging every 30s
2021-09-23T21:12:55.189Z    INFO    beater/nvidiagpubeat.go:57  nvidiagpubeat is running for ** production ** environment. ! Hit CTRL-C to stop it.
2021-09-23T21:13:25.191Z    INFO    [monitoring]    log/log.go:144  Non-zero metrics in the last 30s    {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":0,"time":{"ms":8}},"total":{"ticks":30,"time":{"ms":46},"value":30},"user":{"ticks":30,"time":{"ms":38}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"94a47230-67d7-4616-8d91-b51067182931","uptime":{"ms":30021}},"memstats":{"gc_next":4194304,"memory_alloc":2107976,"memory_total":3747368,"rss":23207936}},"libbeat":{"config":{"module":{"running":0}},"output":{"type":"elasticsearch"},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"cpu":{"cores":48},"load":{"1":0,"15":0.07,"5":0.07,"norm":{"1":0,"15":0.0015,"5":0.0015}}}}}}
2021-09-23T21:13:25.193Z    ERROR   beater/nvidiagpubeat.go:75  Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed
2021-09-23T21:13:55.191Z    INFO    [monitoring]    log/log.go:144  Non-zero metrics in the last 30s    {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":0,"time":{"ms":1}},"total":{"ticks":40,"time":{"ms":4},"value":40},"user":{"ticks":40,"time":{"ms":3}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"94a47230-67d7-4616-8d91-b51067182931","uptime":{"ms":60021}},"memstats":{"gc_next":4194304,"memory_alloc":2472080,"memory_total":4111472}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0,"15":0.07,"5":0.06,"norm":{"1":0,"15":0.0015,"5":0.0013}}}}}}
2021-09-23T21:13:55.193Z    ERROR   beater/nvidiagpubeat.go:75  Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed
2021-09-23T21:14:25.191Z    INFO    [monitoring]    log/log.go:144  Non-zero metrics in the last 30s    {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":0},"total":{"ticks":40,"time":{"ms":3},"value":40},"user":{"ticks":40,"time":{"ms":3}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"94a47230-67d7-4616-8d91-b51067182931","uptime":{"ms":90022}},"memstats":{"gc_next":4194304,"memory_alloc":2823136,"memory_total":4462528}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0,"15":0.07,"5":0.05,"norm":{"1":0,"15":0.0015,"5":0.001}}}}}}
2021-09-23T21:14:25.193Z    ERROR   beater/nvidiagpubeat.go:75  Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed
2021-09-23T21:14:55.191Z    INFO    [monitoring]    log/log.go:144  Non-zero metrics in the last 30s    {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":10,"time":{"ms":2}},"total":{"ticks":60,"time":{"ms":8},"value":60},"user":{"ticks":50,"time":{"ms":6}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"94a47230-67d7-4616-8d91-b51067182931","uptime":{"ms":120022}},"memstats":{"gc_next":4194304,"memory_alloc":1767520,"memory_total":4950528,"rss":303104}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0,"15":0.06,"5":0.05,"norm":{"1":0,"15":0.0013,"5":0.001}}}}}}
2021-09-23T21:14:55.193Z    ERROR   beater/nvidiagpubeat.go:75  Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed
2021-09-23T21:15:25.191Z    INFO    [monitoring]    log/log.go:144  Non-zero metrics in the last 30s    {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":10},"total":{"ticks":60,"time":{"ms":3},"value":60},"user":{"ticks":50,"time":{"ms":3}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"94a47230-67d7-4616-8d91-b51067182931","uptime":{"ms":150021}},"memstats":{"gc_next":4194304,"memory_alloc":2119320,"memory_total":5302328}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0,"15":0.06,"5":0.04,"norm":{"1":0,"15":0.0013,"5":0.0008}}}}}}

Thanks in advance.

anaconda2196 commented 3 years ago

@deepujain Is there something I am missing on my end in configuration ?