NVIDIA / go-dcgm

Golang bindings for Nvidia Datacenter GPU Manager (DCGM)
Apache License 2.0
96 stars 27 forks source link

api Init error #13

Closed young-lee-young closed 1 year ago

young-lee-young commented 3 years ago

there maybe an error in api.go Init function

only err is nil, dcgmInitCounter plus 1

if err == nil { dcgmInitCounter += 1 }

LujieDuan commented 1 year ago

Hey,

This has became a problem blocking us right now.

Issue Description: initDcgm will try to load the SO file first, as a way to check if DCGM is installed. Currently, there is no way to call api.Init more than once when DCGM is not installed, because dcgmInitCounter will become 1 even if DCGM is not installed, and there is no way to decrease it back to 0.

Reproduce steps

  1. Make sure DCGM is NOT installed, and run the following script:
    
    package main

import ( "fmt"

"github.com/NVIDIA/go-dcgm/pkg/dcgm"

)

func main() { _, err := dcgm.Init(dcgm.Standalone, "localhost:5555", "0") if err != nil { fmt.Println(err) } else { fmt.Println("Connected!") }

_, err = dcgm.Init(dcgm.Standalone, "localhost:5555", "0")
if err != nil {
    fmt.Println(err)
} else {
    fmt.Println("Connected!")
}

}

output: 

libdcgm.so not Found Connected!

expected: 

libdcgm.so not Found libdcgm.so not Found


2. If calls the clean up function in between: 

package main

import ( "fmt"

"github.com/NVIDIA/go-dcgm/pkg/dcgm"

)

func main() { cleanup, err := dcgm.Init(dcgm.Standalone, "localhost:5555", "0") if err != nil { fmt.Println(err) } else { fmt.Println("Connected!") }

cleanup()

_, err = dcgm.Init(dcgm.Standalone, "localhost:5555", "0")
if err != nil {
    fmt.Println(err)
} else {
    fmt.Println("Connected!")
}

}

output:

libdcgm.so not Found SIGSEGV: segmentation violation PC=0x0 m=0 sigcode=1 signal arrived during cgo execution

goroutine 1 [syscall]: runtime.cgocall(0x4c5f60, 0xc0001a5d98) /usr/lib/google-golang/src/runtime/cgocall.go:157 +0x4b fp=0xc0001a5d70 sp=0xc0001a5d38 pc=0x40694b github.com/NVIDIA/go-dcgm/pkg/dcgm._Cfunc_dcgmStopEmbedded(0x0) _cgo_gotypes.go:1355 +0x47 fp=0xc0001a5d98 sp=0xc0001a5d70 pc=0x4c3927 github.com/NVIDIA/go-dcgm/pkg/dcgm.stopEmbedded() /usr/local/google/home/lujieduan/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20230516210056-6d8fa5f863f8/pkg/dcgm/admin.go:118 +0x25 fp=0xc0001a5de0 sp=0xc0001a5d98 pc=0x4c4105 github.com/NVIDIA/go-dcgm/pkg/dcgm.shutdown() /usr/local/google/home/lujieduan/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20230516210056-6d8fa5f863f8/pkg/dcgm/admin.go:91 +0x47 fp=0xc0001a5e00 sp=0xc0001a5de0 pc=0x4c3f87 github.com/NVIDIA/go-dcgm/pkg/dcgm.Shutdown() /usr/local/google/home/lujieduan/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20230516210056-6d8fa5f863f8/pkg/dcgm/api.go:46 +0x85 fp=0xc0001a5e48 sp=0xc0001a5e00 pc=0x4c3485 github.com/NVIDIA/go-dcgm/pkg/dcgm.Init.func1() /usr/local/google/home/lujieduan/go/pkg/mod/github.com/!n!v!i!d!i!a/go-dcgm@v0.0.0-20230516210056-6d8fa5f863f8/pkg/dcgm/api.go:33 +0x13 fp=0xc0001a5ea0 sp=0xc0001a5e48 pc=0x4c59b3 main.main() /usr/local/google/home/lujieduan/source/test/no_dcgm/main.go:17 +0xfa fp=0xc0001a5f40 sp=0xc0001a5ea0 pc=0x4c5cba runtime.main() /usr/lib/google-golang/src/runtime/proc.go:267 +0x2bb fp=0xc0001a5fe0 sp=0xc0001a5f40 pc=0x4397bb runtime.goexit() /usr/lib/google-golang/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0001a5fe8 sp=0xc0001a5fe0 pc=0x466d41

goroutine 2 [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /usr/lib/google-golang/src/runtime/proc.go:401 +0xce fp=0xc0000a8fa8 sp=0xc0000a8f88 pc=0x439bee runtime.goparkunlock(...) /usr/lib/google-golang/src/runtime/proc.go:407 runtime.forcegchelper() /usr/lib/google-golang/src/runtime/proc.go:325 +0xb3 fp=0xc0000a8fe0 sp=0xc0000a8fa8 pc=0x439a73 runtime.goexit() /usr/lib/google-golang/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000a8fe8 sp=0xc0000a8fe0 pc=0x466d41 created by runtime.init.6 in goroutine 1 /usr/lib/google-golang/src/runtime/proc.go:313 +0x1a

goroutine 3 [GC sweep wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /usr/lib/google-golang/src/runtime/proc.go:401 +0xce fp=0xc0000a9778 sp=0xc0000a9758 pc=0x439bee runtime.goparkunlock(...) /usr/lib/google-golang/src/runtime/proc.go:407 runtime.bgsweep(0x0?) /usr/lib/google-golang/src/runtime/mgcsweep.go:280 +0x94 fp=0xc0000a97c8 sp=0xc0000a9778 pc=0x426154 runtime.gcenable.func1() /usr/lib/google-golang/src/runtime/mgc.go:200 +0x25 fp=0xc0000a97e0 sp=0xc0000a97c8 pc=0x41b405 runtime.goexit() /usr/lib/google-golang/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000a97e8 sp=0xc0000a97e0 pc=0x466d41 created by runtime.gcenable in goroutine 1 /usr/lib/google-golang/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]: runtime.gopark(0xc0000d0000?, 0x516f28?, 0x1?, 0x0?, 0xc0000071e0?) /usr/lib/google-golang/src/runtime/proc.go:401 +0xce fp=0xc0000a9f70 sp=0xc0000a9f50 pc=0x439bee runtime.goparkunlock(...) /usr/lib/google-golang/src/runtime/proc.go:407 runtime.(*scavengerState).park(0x5c2180) /usr/lib/google-golang/src/runtime/mgcscavenge.go:426 +0x49 fp=0xc0000a9fa0 sp=0xc0000a9f70 pc=0x423989 runtime.bgscavenge(0x0?) /usr/lib/google-golang/src/runtime/mgcscavenge.go:654 +0x3c fp=0xc0000a9fc8 sp=0xc0000a9fa0 pc=0x423f1c runtime.gcenable.func2() /usr/lib/google-golang/src/runtime/mgc.go:201 +0x25 fp=0xc0000a9fe0 sp=0xc0000a9fc8 pc=0x41b3a5 runtime.goexit() /usr/lib/google-golang/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000a9fe8 sp=0xc0000a9fe0 pc=0x466d41 created by runtime.gcenable in goroutine 1 /usr/lib/google-golang/src/runtime/mgc.go:201 +0xa5

goroutine 5 [finalizer wait]: runtime.gopark(0x4eeb00?, 0x10043ad01?, 0x0?, 0x0?, 0x441e25?) /usr/lib/google-golang/src/runtime/proc.go:401 +0xce fp=0xc0000a8628 sp=0xc0000a8608 pc=0x439bee runtime.runfinq() /usr/lib/google-golang/src/runtime/mfinal.go:193 +0x107 fp=0xc0000a87e0 sp=0xc0000a8628 pc=0x41a487 runtime.goexit() /usr/lib/google-golang/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000a87e8 sp=0xc0000a87e0 pc=0x466d41 created by runtime.createfing in goroutine 1 /usr/lib/google-golang/src/runtime/mfinal.go:163 +0x3d

rax 0xc0001a6000 rbx 0xc0001a5d98 rcx 0xc0001a5d98 rdx 0xc0001a5d28 rdi 0x0 rsi 0x5c22e0 rbp 0xc0001a5d28 rsp 0x7ffd3501efa8 r8 0x5c26c0 r9 0x0 r10 0x1 r11 0x216 r12 0x0 r13 0xfffffffffffffff r14 0xc0001a6000 r15 0x3e rip 0x0 rflags 0x10216 cs 0x33 fs 0x0 gs 0x0 exit status 2



**Expected Behaviour**
We want to repeatedly check if the app can communicate with the DCGM service, until a successful `Init()`. Right now there is no way to do that because `Init()` will get `dcgmInitCounter` to 1 even if DCGM is not installed, and then unable to decrease it back to 0 or try to `init()` again correctly. 
glowkey commented 1 year ago

A fix for this has been committed.

LujieDuan commented 1 year ago

Tested and works great! Thanks for fixing this!