eBay / nvidiagpubeat

nvidiagpubeat is an elastic beat that uses NVIDIA System Management Interface (nvidia-smi) to monitor NVIDIA GPU devices and can ingest metrics into Elastic search cluster, with support for both 6.x and 7.x versions of beats. nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
Apache License 2.0
54 stars 22 forks source link

fork/exec error #6

Closed somabc closed 5 years ago

somabc commented 5 years ago

After building on Ubuntu 18.04 and running ./nvidiagpubeat environment is production.

./nvidiagpubeat -c nvidiagpubeat.yml -e -d "*" I get a lot of fork/exec errors

2019-02-06T18:21:57.750Z INFO [monitoring] log/log.go:117 Starting metrics logging every 30s 2019-02-06T18:21:58.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted 2019-02-06T18:21:59.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted 2019-02-06T18:22:00.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted 2019-02-06T18:22:01.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted 2019-02-06T18:22:02.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted

somabc commented 5 years ago

Similar prob in old git repo: https://github.com/deepujain/nvidiagpubeat/issues/20

deepujain commented 5 years ago

@somabc Beat is failing to determine the number of gpu cards and is failing at gpucount.go What output do you see when you run

 bash -c "ls /dev | grep nvidia | grep -v nvidia-uvm | grep -v nvidiactl | wc -l"

I am able to run the command from a standalone go file but not from beat.

import (

func command() {
    gpucount := "ls /dev | grep nvidia | grep -v nvidia-uvm | grep -v nvidiactl | wc -l"
    cmd := exec.Command("bash", "-c", gpucount)
    out, err := cmd.Output()
    if err == nil {

func main() {
deepujain commented 5 years ago

@somabc Thanks for using nvidiagpubeat.

Fix is available at https://github.com/eBay/nvidiagpubeat#run-in-production, Use -E seccomp.enabled=false

Please update this Issue with results of your test.

minduni commented 5 years ago

@deepujain I was testing a fix I found adding the seccomp.enabled:false in the nvidiagpubeat.yml, but didn't worked. I confirm, adding the -E seccomp.enabled=false in the command line fixed the issue also on RH 7.5.

somabc commented 5 years ago

bash -c "ls /dev | grep nvidia | grep -v nvidia-uvm | grep -v nvidiactl | wc -l"

gives an output of 1

Thanks added -E seccomp.enabled=false and it's working now

deepujain commented 5 years ago

@somabc @minduni Thanks for confirmation. You are correct. I did try the .yml fix and it did not work. However the -E worked.

Also, could you please star the repository, if you find nvidiagpubeat useful. It helps in tracking number of users.