eBay / nvidiagpubeat

nvidiagpubeat is an elastic beat that uses NVIDIA System Management Interface (nvidia-smi) to monitor NVIDIA GPU devices and can ingest metrics into Elastic search cluster, with support for both 6.x and 7.x versions of beats. nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
https://github.com/eBay/nvidiagpubeat
Apache License 2.0
54 stars 22 forks source link

error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed #24

Closed NAshwinKumar closed 5 years ago

NAshwinKumar commented 5 years ago

Can someone help in solving the issue

deepujain commented 5 years ago
  1. Full stack trace and logs when you run nvidiagpubeat executable. ( Enclose in three back ticks otherwise logs will be unreadable )
  2. What branch did you build.
  3. Details on your environment.
  4. Share the output of nvidia-smi command
  5. What is output of ls /dev | grep nvidia | grep -v nvidia-uvm | grep -v nvidiactl | wc -l
  6. What is output of nvidiagpubeat --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,temperature.gpu,pstate --format=csv
NAshwinKumar commented 5 years ago

1) Full stack trace

2019-09-04T21:08:38.188+0530    INFO    instance/beat.go:607    Home path: [/home/ashwin/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat] Config path: [/home/ashwin/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat] Data path: [/home/ashwin/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat/data] Logs path: [/home/ashwin/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat/logs]
2019-09-04T21:08:38.188+0530    DEBUG   [beat]  instance/beat.go:659    Beat metadata path: /home/ashwin/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat/data/meta.json
2019-09-04T21:08:38.188+0530    INFO    instance/beat.go:615    Beat ID: 68386e1f-0080-4249-ae78-5278a46d79ac
2019-09-04T21:08:38.189+0530    INFO    [beat]  instance/beat.go:903    Beat info       {"system_info": {"beat": {"path": {"config": "/home/ashwin/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat", "data": "/home/ashwin/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat/data", "home": "/home/ashwin/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat", "logs": "/home/ashwin/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat/logs"}, "type": "nvidiagpubeat", "uuid": "68386e1f-0080-4249-ae78-5278a46d79ac"}}}
2019-09-04T21:08:38.189+0530    INFO    [beat]  instance/beat.go:912    Build info      {"system_info": {"build": {"commit": "unknown", "libbeat": "7.3.2", "time": "1754-08-30T22:43:41.128Z", "version": "7.3.2"}}}
2019-09-04T21:08:38.189+0530    INFO    [beat]  instance/beat.go:915    Go runtime info {"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":4,"version":"go1.12.9"}}}
2019-09-04T21:08:38.192+0530    INFO    [beat]  instance/beat.go:919    Host info       {"system_info": {"host": {"architecture":"x86_64","boot_time":"2019-09-03T20:40:11+05:30","containerized":false,"name":"linux-d4hc","ip":["127.0.0.1/8","::1/128","192.168.29.221/24","2405:201:e806:9f60:29d1:864b:af2b:f9f0/64","2405:201:e806:9f60:7a45:61ff:fec0:c319/64","fe80::7a45:61ff:fec0:c319/64"],"kernel_version":"4.12.14-lp151.27-default","mac":["c8:5b:76:68:99:f7","78:45:61:c0:c3:19"],"os":{"family":"","platform":"opensuse-leap","name":"openSUSE Leap","version":"15.1","major":15,"minor":1,"patch":0},"timezone":"IST","timezone_offset_sec":19800,"id":"1ae32b0454884a1cac7ab936ce597373"}}}
2019-09-04T21:08:38.193+0530    INFO    [beat]  instance/beat.go:948    Process info    {"system_info": {"process": {"capabilities": {"inheritable":null,"permitted":null,"effective":null,"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"ambient":null}, "cwd": "/home/ashwin/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat", "exe": "/home/ashwin/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat/nvidiagpubeat", "name": "nvidiagpubeat", "pid": 2913, "ppid": 27508, "seccomp": {"mode":"disabled","no_new_privs":false}, "start_time": "2019-09-04T21:08:37.370+0530"}}}
2019-09-04T21:08:38.193+0530    INFO    instance/beat.go:292    Setup Beat: nvidiagpubeat; Version: 7.3.2
2019-09-04T21:08:38.194+0530    DEBUG   [beat]  instance/beat.go:318    Initializing output plugins
2019-09-04T21:08:38.194+0530    INFO    [index-management]      idxmgmt/std.go:178      Set output.elasticsearch.index to 'nvidiagpubeat-7.3.2' as ILM is enabled.
2019-09-04T21:08:38.194+0530    INFO    elasticsearch/client.go:170     Elasticsearch url: http://localhost:9200
2019-09-04T21:08:38.195+0530    DEBUG   [publisher]     pipeline/consumer.go:137        start pipeline event consumer
2019-09-04T21:08:38.195+0530    INFO    [publisher]     pipeline/module.go:97   Beat name: linux-d4hc
2019-09-04T21:08:38.196+0530    INFO    [monitoring]    log/log.go:118  Starting metrics logging every 30s
2019-09-04T21:08:38.196+0530    INFO    instance/beat.go:422    nvidiagpubeat start running.
2019-09-04T21:08:38.196+0530    INFO    beater/nvidiagpubeat.go:57      nvidiagpubeat is running for ** test ** environment. ! Hit CTRL-C to stop it.
2019-09-04T21:08:39.205+0530    ERROR   beater/nvidiagpubeat.go:75      Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed
2019-09-04T21:08:40.207+0530    ERROR   beater/nvidiagpubeat.go:75      Event not generated, error: Unable to fetch any events from nvidia-smi: Error read |0: file already closed
^C2019-09-04T21:08:41.178+0530  DEBUG   [service]       service/service.go:53   Received sigterm/sigint, stopping
2019-09-04T21:08:41.178+0530    DEBUG   [publisher]     pipeline/client.go:149  client: closing acker
2019-09-04T21:08:41.178+0530    DEBUG   [publisher]     pipeline/client.go:151  client: done closing acker
2019-09-04T21:08:41.178+0530    DEBUG   [publisher]     pipeline/client.go:155  client: cancelled 0 events
2019-09-04T21:08:41.184+0530    INFO    [monitoring]    log/log.go:153  Total non-zero metrics  {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":20,"time":{"ms":24}},"total":{"ticks":60,"time":{"ms":64},"value":60},"user":{"ticks":40,"time":{"ms":40}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":5},"info":{"ephemeral_id":"f413aa9b-3ef6-4b77-998e-6e1f39166bc3","uptime":{"ms":3010}},"memstats":{"gc_next":4194304,"memory_alloc":1289056,"memory_total":3070424,"rss":23531520},"runtime":{"goroutines":8}},"libbeat":{"config":{"module":{"running":0}},"output":{"type":"elasticsearch"},"pipeline":{"clients":0,"events":{"active":0}}},"system":{"cpu":{"cores":4},"load":{"1":1.65,"15":2.23,"5":2.1,"norm":{"1":0.4125,"15":0.5575,"5":0.525}}}}}}
2019-09-04T21:08:41.185+0530    INFO    [monitoring]    log/log.go:154  Uptime: 3.016770154s
2019-09-04T21:08:41.185+0530    INFO    [monitoring]    log/log.go:131  Stopping metrics logging.
2019-09-04T21:08:41.185+0530    INFO    instance/beat.go:432    nvidiagpubeat stopped.
    • Beats branch : 7.3
    • nvidiagpubeat branch : withBeats7.3
  1. OS: openSUSE Leap 15.1

  2. ashwin@linux-d4hc:~> nvidia-smi
    If 'nvidia-smi' is not a typo you can use command-not-found to lookup the package that contains it, like this:
    cnf nvidia-smi
  3. ashwin@linux-d4hc:~> ls /dev | grep nvidia | grep -v nvidia-uvm | grep -v nvidiactl | wc -l
    0
  4. 
    ashwin@linux-d4hc:~/Downloads/beats_dev/src/github.com/ebay/nvidiagpubeat> nvidiagpubeat --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,temperature.gpu,pstate --format=csv
    Error: unknown flag: --query-gpu
    Usage:
    nvidiagpubeat [flags]
    nvidiagpubeat [command]

Available Commands: export Export current config or index template help Help about any command keystore Manage secrets keystore run Run nvidiagpubeat setup Setup index template, dashboards and ML jobs test Test config version Show current version info

Flags: -E, --E setting=value Configuration overwrite -N, --N Disable actual publishing for testing -c, --c string Configuration file, relative to path.config (default "nvidiagpubeat.yml") --cpuprofile string Write cpu profile to file -d, --d string Enable certain debug selectors -e, --e Log to stderr and disable syslog/file output -h, --help help for nvidiagpubeat --httpprof string Start pprof http server --memprofile string Write memory profile to this file --path.config string Configuration path --path.data string Data path --path.home string Home path --path.logs string Logs path --plugin pluginList Load additional plugins --strict.perms Strict permission checking on config files (default true) -v, --v Log at INFO level

Use "nvidiagpubeat [command] --help" for more information about a command.

deepujain commented 5 years ago

4 indicates nvidia-smi is not in PATH or not installed at all. nvidia-smi is NVIDIA GPU driver that can collect metrics from gpu cards.

I can add checks and throw appropriate error message, if this is the root cause of this issue.

3 indicates 0 GPU Cards, on the current machine.

5 was my typo. Can you run


nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,temperature.gpu,pstate --format=csv
Error: unknown flag: --query-gpu```
deepujain commented 5 years ago

I noticed this from your logs

2019-09-04T21:08:38.196+0530    INFO    instance/beat.go:422    nvidiagpubeat start running.
2019-09-04T21:08:38.196+0530    INFO    beater/nvidiagpubeat.go:57      nvidiagpubeat is running for ** test ** environment. ! Hit CTRL-C to stop it.

And you are running on Suse Linux.

https://github.com/eBay/nvidiagpubeat#run-in-test-environment-macos indicates that "test" mode is supported on MacOS. Test mode uses localnvidiasmi and that executable is built on and for MacOS.

deepujain commented 5 years ago

Do you want to work on https://github.com/eBay/nvidiagpubeat/issues/25 ? It will fix current issue.

NAshwinKumar commented 5 years ago

Thanks deepujain. Installing nvidia-smi solved the issue