eBay / nvidiagpubeat

nvidiagpubeat is an elastic beat that uses NVIDIA System Management Interface (nvidia-smi) to monitor NVIDIA GPU devices and can ingest metrics into Elastic search cluster, with support for both 6.x and 7.x versions of beats. nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
https://github.com/eBay/nvidiagpubeat
Apache License 2.0
54 stars 22 forks source link

fork/exec error #6

Closed somabc closed 5 years ago

somabc commented 5 years ago

After building on Ubuntu 18.04 and running ./nvidiagpubeat environment is production.

./nvidiagpubeat -c nvidiagpubeat.yml -e -d "*" I get a lot of fork/exec errors

2019-02-06T18:21:57.750Z INFO [monitoring] log/log.go:117 Starting metrics logging every 30s 2019-02-06T18:21:58.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted 2019-02-06T18:21:59.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted 2019-02-06T18:22:00.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted 2019-02-06T18:22:01.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted 2019-02-06T18:22:02.751Z ERROR beater/nvidiagpubeat.go:75 Event not generated, error: fork/exec /bin/bash: operation not permitted

somabc commented 5 years ago

Similar prob in old git repo: https://github.com/deepujain/nvidiagpubeat/issues/20

deepujain commented 5 years ago

@somabc Beat is failing to determine the number of gpu cards and is failing at gpucount.go What output do you see when you run

 bash -c "ls /dev | grep nvidia | grep -v nvidia-uvm | grep -v nvidiactl | wc -l"

I am able to run the command from a standalone go file but not from beat.

import (
    "fmt"
    "os/exec"
)

func command() {
    gpucount := "ls /dev | grep nvidia | grep -v nvidia-uvm | grep -v nvidiactl | wc -l"
    cmd := exec.Command("bash", "-c", gpucount)
    out, err := cmd.Output()
    if err == nil {
        fmt.Println(string(out))
    }
}

func main() {
    command()
}
deepujain commented 5 years ago

@somabc Thanks for using nvidiagpubeat.

Fix is available at https://github.com/eBay/nvidiagpubeat#run-in-production, Use -E seccomp.enabled=false

Please update this Issue with results of your test.

minduni commented 5 years ago

@deepujain I was testing a fix I found adding the seccomp.enabled:false in the nvidiagpubeat.yml, but didn't worked. I confirm, adding the -E seccomp.enabled=false in the command line fixed the issue also on RH 7.5.

somabc commented 5 years ago

bash -c "ls /dev | grep nvidia | grep -v nvidia-uvm | grep -v nvidiactl | wc -l"

gives an output of 1

Thanks added -E seccomp.enabled=false and it's working now

deepujain commented 5 years ago

@somabc @minduni Thanks for confirmation. You are correct. I did try the .yml fix and it did not work. However the -E worked.

Also, could you please star the repository, if you find nvidiagpubeat useful. It helps in tracking number of users.