NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)
Apache License 2.0
311 stars 64 forks source link

gen/nvml: add `--export-dynamic` linker flag #79

Closed braydonk closed 1 year ago

braydonk commented 1 year ago

Closes #36

Investigation on the Go side https://github.com/golang/go/issues/63264

When built with go1.21.x, go-nvml panics. This is due to a change in how go tool link executes the external linker. Previously, -rdynamic was always passed to the external linker, however this was changed to only happen in certain circumstances, such as when linking to previously built CGO shared objects, or when the flag to export specific dynamic symbols isn't supported. The justification for the change is in https://github.com/golang/go/issues/53579.

The resolution I propose for go-nvml is to pass the --export-dynamic flag into LDFLAGS explicitly. This LDFLAG is required because if it is not present, the symbols from nvml.h are not added to the PLT, meaning each call to a symbol from nvml.h will have an address of 0x0 instead of a PLT offset (at least in ELF AMD64, though I assume some equivalent occurs in other setups).
Passing this flag explicitly means whether it's built with a Go version that always applies -rdynamic or not, this required flag will be present.

I tested this PR by uncommenting the commented out test command from make test. On the main branch, it passes all dl tests, then panics.

(base) braydonk@gpu-tester:~/go-nvml$ make test
cp /home/braydonk/go-nvml/gen/nvml/nvml.yml /home/braydonk/go-nvml/pkg/nvml
c-for-go -out /home/braydonk/go-nvml/pkg /home/braydonk/go-nvml/pkg/nvml/nvml.yml
  processing /home/braydonk/go-nvml/pkg/nvml/nvml.yml done.
cp /home/braydonk/go-nvml/gen/nvml/*.go /home/braydonk/go-nvml/pkg/nvml
cd /home/braydonk/go-nvml/pkg/nvml; \
        go tool cgo -godefs types.go > types_gen.go; \
        go fmt types_gen.go; \
cd -> /dev/null
types_gen.go
rm -rf /home/braydonk/go-nvml/pkg/nvml/nvml.yml /home/braydonk/go-nvml/pkg/nvml/types.go /home/braydonk/go-nvml/pkg/nvml/_obj
grep -l -R "// WARNING: This file has automatically been generated on" pkg \
        | xargs sed -i -E 's#// WARNING: This file has automatically been generated on.*$#// WARNING: THIS FILE WAS AUTOMATICALLY GENERATED.#g'
grep -l -RE "// (.*) nvml/nvml.h:[0-9]+" pkg \
        | xargs sed -i -E 's#// (.*) nvml/nvml.h:[0-9]+$#// \1 nvml/nvml.h#g'
GOOS=linux go build github.com/NVIDIA/go-nvml/pkg/...
go test -v -coverprofile=coverage.out github.com/NVIDIA/go-nvml/pkg/...
=== RUN   TestNew
=== PAUSE TestNew
=== RUN   TestOpenSuccess
=== PAUSE TestOpenSuccess
=== RUN   TestOpenFailed
=== PAUSE TestOpenFailed
=== RUN   TestOpenTwice
=== PAUSE TestOpenTwice
=== RUN   TestClose
=== PAUSE TestClose
=== RUN   TestLookupSuccess
=== PAUSE TestLookupSuccess
=== RUN   TestLookupFailed
=== PAUSE TestLookupFailed
=== CONT  TestNew
--- PASS: TestNew (0.00s)
=== CONT  TestLookupFailed
--- PASS: TestLookupFailed (0.00s)
=== CONT  TestLookupSuccess
--- PASS: TestLookupSuccess (0.00s)
=== CONT  TestClose
--- PASS: TestClose (0.00s)
=== CONT  TestOpenTwice
--- PASS: TestOpenTwice (0.00s)
=== CONT  TestOpenFailed
=== CONT  TestOpenSuccess
--- PASS: TestOpenSuccess (0.00s)
--- PASS: TestOpenFailed (0.00s)
PASS
coverage: 92.1% of statements
ok      github.com/NVIDIA/go-nvml/pkg/dl        0.005s  coverage: 92.1% of statements
=== RUN   TestInit
SIGSEGV: segmentation violation
PC=0x0 m=0 sigcode=1
signal arrived during cgo execution

goroutine 19 [syscall]:
runtime.cgocall(0x54f4a0, 0xc0000426e0)
        /home/braydonk/sdk/go1.21.1/src/runtime/cgocall.go:157 +0x4b fp=0xc0000426b8 sp=0xc000042680 pc=0x40684b
github.com/NVIDIA/go-nvml/pkg/nvml._Cfunc_nvmlInit_v2()
        _cgo_gotypes.go:4597 +0x47 fp=0xc0000426e0 sp=0xc0000426b8 pc=0x5467c7
github.com/NVIDIA/go-nvml/pkg/nvml.nvmlInit_v2()
        /home/braydonk/go-nvml/pkg/nvml/nvml.go:32 +0x3a fp=0xc0000426f8 sp=0xc0000426e0 pc=0x547fda
github.com/NVIDIA/go-nvml/pkg/nvml.Init()
        /home/braydonk/go-nvml/pkg/nvml/init.go:43 +0xc2 fp=0xc000042720 sp=0xc0000426f8 pc=0x5476a2
github.com/NVIDIA/go-nvml/pkg/nvml.TestInit(0xc000082d00)
        /home/braydonk/go-nvml/pkg/nvml/nvml_test.go:22 +0x1c fp=0xc000042770 sp=0xc000042720 pc=0x5420dc
testing.tRunner(0xc000082d00, 0x5a2808)
        /home/braydonk/sdk/go1.21.1/src/testing/testing.go:1595 +0xff fp=0xc0000427c0 sp=0xc000042770 pc=0x51bcbf
testing.(*T).Run.func1()
        /home/braydonk/sdk/go1.21.1/src/testing/testing.go:1648 +0x25 fp=0xc0000427e0 sp=0xc0000427c0 pc=0x51cc45
runtime.goexit()
        /home/braydonk/sdk/go1.21.1/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000427e8 sp=0xc0000427e0 pc=0x46c3a1
created by testing.(*T).Run in goroutine 1
        /home/braydonk/sdk/go1.21.1/src/testing/testing.go:1648 +0x3ad

goroutine 1 [chan receive]:
runtime.gopark(0xc0000569d8?, 0x40f645?, 0x90?, 0xc0?, 0x18?)
        /home/braydonk/sdk/go1.21.1/src/runtime/proc.go:398 +0xce fp=0xc000056970 sp=0xc000056950 pc=0x43c46e
runtime.chanrecv(0xc0000d2070, 0xc000056a57, 0x1)
        /home/braydonk/sdk/go1.21.1/src/runtime/chan.go:583 +0x3cd fp=0xc0000569e8 sp=0xc000056970 pc=0x408c0d
runtime.chanrecv1(0x6d5440?, 0x563340?)
        /home/braydonk/sdk/go1.21.1/src/runtime/chan.go:442 +0x12 fp=0xc000056a10 sp=0xc0000569e8 pc=0x408832
testing.(*T).Run(0xc000082b60, {0x594c9b?, 0x51b9fc?}, 0x5a2808)
        /home/braydonk/sdk/go1.21.1/src/testing/testing.go:1649 +0x3c8 fp=0xc000056ad0 sp=0xc000056a10 pc=0x51cae8
testing.runTests.func1(0x6d5f00?)
        /home/braydonk/sdk/go1.21.1/src/testing/testing.go:2054 +0x3e fp=0xc000056b20 sp=0xc000056ad0 pc=0x51eb9e
testing.tRunner(0xc000082b60, 0xc000056c38)
        /home/braydonk/sdk/go1.21.1/src/testing/testing.go:1595 +0xff fp=0xc000056b70 sp=0xc000056b20 pc=0x51bcbf
testing.runTests(0xc0000a20a0?, {0x6bd940, 0x4, 0x4}, {0x41563f?, 0xc000056cf8?, 0x6d56e0?})
        /home/braydonk/sdk/go1.21.1/src/testing/testing.go:2052 +0x445 fp=0xc000056c68 sp=0xc000056b70 pc=0x51ea85
testing.(*M).Run(0xc0000a20a0)
        /home/braydonk/sdk/go1.21.1/src/testing/testing.go:1925 +0x636 fp=0xc000056eb0 sp=0xc000056c68 pc=0x51d476
main.main()
        _testmain.go:87 +0x1bf fp=0xc000056f40 sp=0xc000056eb0 pc=0x54bd7f
runtime.main()
        /home/braydonk/sdk/go1.21.1/src/runtime/proc.go:267 +0x2bb fp=0xc000056fe0 sp=0xc000056f40 pc=0x43bffb
runtime.goexit()
        /home/braydonk/sdk/go1.21.1/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000056fe8 sp=0xc000056fe0 pc=0x46c3a1

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /home/braydonk/sdk/go1.21.1/src/runtime/proc.go:398 +0xce fp=0xc000046fa8 sp=0xc000046f88 pc=0x43c46e
runtime.goparkunlock(...)
        /home/braydonk/sdk/go1.21.1/src/runtime/proc.go:404
runtime.forcegchelper()
        /home/braydonk/sdk/go1.21.1/src/runtime/proc.go:322 +0xb3 fp=0xc000046fe0 sp=0xc000046fa8 pc=0x43c2d3
runtime.goexit()
        /home/braydonk/sdk/go1.21.1/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000046fe8 sp=0xc000046fe0 pc=0x46c3a1
created by runtime.init.6 in goroutine 1
        /home/braydonk/sdk/go1.21.1/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /home/braydonk/sdk/go1.21.1/src/runtime/proc.go:398 +0xce fp=0xc000047778 sp=0xc000047758 pc=0x43c46e
runtime.goparkunlock(...)
        /home/braydonk/sdk/go1.21.1/src/runtime/proc.go:404
runtime.bgsweep(0x0?)
        /home/braydonk/sdk/go1.21.1/src/runtime/mgcsweep.go:280 +0x94 fp=0xc0000477c8 sp=0xc000047778 pc=0x426dd4
runtime.gcenable.func1()
        /home/braydonk/sdk/go1.21.1/src/runtime/mgc.go:200 +0x25 fp=0xc0000477e0 sp=0xc0000477c8 pc=0x41bf65
runtime.goexit()
        /home/braydonk/sdk/go1.21.1/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000477e8 sp=0xc0000477e0 pc=0x46c3a1
created by runtime.gcenable in goroutine 1
        /home/braydonk/sdk/go1.21.1/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc000026150?, 0x5cc020?, 0x1?, 0x0?, 0xc0000071e0?)
        /home/braydonk/sdk/go1.21.1/src/runtime/proc.go:398 +0xce fp=0xc000047f70 sp=0xc000047f50 pc=0x43c46e
runtime.goparkunlock(...)
        /home/braydonk/sdk/go1.21.1/src/runtime/proc.go:404
runtime.(*scavengerState).park(0x6d5760)
        /home/braydonk/sdk/go1.21.1/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000047fa0 sp=0xc000047f70 pc=0x424669
runtime.bgscavenge(0x0?)
        /home/braydonk/sdk/go1.21.1/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc000047fc8 sp=0xc000047fa0 pc=0x424bfc
runtime.gcenable.func2()
        /home/braydonk/sdk/go1.21.1/src/runtime/mgc.go:201 +0x25 fp=0xc000047fe0 sp=0xc000047fc8 pc=0x41bf05
runtime.goexit()
        /home/braydonk/sdk/go1.21.1/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000047fe8 sp=0xc000047fe0 pc=0x46c3a1
created by runtime.gcenable in goroutine 1
        /home/braydonk/sdk/go1.21.1/src/runtime/mgc.go:201 +0xa5

goroutine 18 [finalizer wait]:
runtime.gopark(0x590ea0?, 0x10043d601?, 0x0?, 0x0?, 0x444625?)
        /home/braydonk/sdk/go1.21.1/src/runtime/proc.go:398 +0xce fp=0xc000046628 sp=0xc000046608 pc=0x43c46e
runtime.runfinq()
        /home/braydonk/sdk/go1.21.1/src/runtime/mfinal.go:193 +0x107 fp=0xc0000467e0 sp=0xc000046628 pc=0x41afe7
runtime.goexit()
        /home/braydonk/sdk/go1.21.1/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000467e8 sp=0xc0000467e0 pc=0x46c3a1
created by runtime.createfing in goroutine 1
        /home/braydonk/sdk/go1.21.1/src/runtime/mfinal.go:163 +0x3d

rax    0xc000042800
rbx    0xc0000426e0
rcx    0xc0000426e0
rdx    0xc000042670
rdi    0xc0000426e0
rsi    0x6d58e0
rbp    0xc000042670
rsp    0x7ffc7d2d4098
r8     0x6d5f00
r9     0x0
r10    0xfffffffffffff6c5
r11    0x6
r12    0xc000042800
r13    0x5
r14    0xc000082ea0
r15    0x1
rip    0x0
rflags 0x10206
cs     0x33
fs     0x0
gs     0x0
FAIL    github.com/NVIDIA/go-nvml/pkg/nvml      0.007s
FAIL
make: *** [Makefile:99: test] Error 1

On this branch, all tests pass.

(base) braydonk@gpu-tester:~/go-nvml$ git checkout export-dynamic
M       Makefile
Branch 'export-dynamic' set up to track remote branch 'export-dynamic' from 'origin'.
Switched to a new branch 'export-dynamic'
(base) braydonk@gpu-tester:~/go-nvml$ make test
cp /home/braydonk/go-nvml/gen/nvml/nvml.yml /home/braydonk/go-nvml/pkg/nvml
c-for-go -out /home/braydonk/go-nvml/pkg /home/braydonk/go-nvml/pkg/nvml/nvml.yml
  processing /home/braydonk/go-nvml/pkg/nvml/nvml.yml done.
cp /home/braydonk/go-nvml/gen/nvml/*.go /home/braydonk/go-nvml/pkg/nvml
cd /home/braydonk/go-nvml/pkg/nvml; \
        go tool cgo -godefs types.go > types_gen.go; \
        go fmt types_gen.go; \
cd -> /dev/null
types_gen.go
rm -rf /home/braydonk/go-nvml/pkg/nvml/nvml.yml /home/braydonk/go-nvml/pkg/nvml/types.go /home/braydonk/go-nvml/pkg/nvml/_obj
grep -l -R "// WARNING: This file has automatically been generated on" pkg \
        | xargs sed -i -E 's#// WARNING: This file has automatically been generated on.*$#// WARNING: THIS FILE WAS AUTOMATICALLY GENERATED.#g'
grep -l -RE "// (.*) nvml/nvml.h:[0-9]+" pkg \
        | xargs sed -i -E 's#// (.*) nvml/nvml.h:[0-9]+$#// \1 nvml/nvml.h#g'
GOOS=linux go build github.com/NVIDIA/go-nvml/pkg/...
go test -v -coverprofile=coverage.out github.com/NVIDIA/go-nvml/pkg/...
=== RUN   TestNew
=== PAUSE TestNew
=== RUN   TestOpenSuccess
=== PAUSE TestOpenSuccess
=== RUN   TestOpenFailed
=== PAUSE TestOpenFailed
=== RUN   TestOpenTwice
=== PAUSE TestOpenTwice
=== RUN   TestClose
=== PAUSE TestClose
=== RUN   TestLookupSuccess
=== PAUSE TestLookupSuccess
=== RUN   TestLookupFailed
=== PAUSE TestLookupFailed
=== CONT  TestNew
--- PASS: TestNew (0.00s)
=== CONT  TestLookupFailed
--- PASS: TestLookupFailed (0.00s)
=== CONT  TestLookupSuccess
--- PASS: TestLookupSuccess (0.00s)
=== CONT  TestClose
--- PASS: TestClose (0.00s)
=== CONT  TestOpenTwice
--- PASS: TestOpenTwice (0.00s)
=== CONT  TestOpenFailed
--- PASS: TestOpenFailed (0.00s)
=== CONT  TestOpenSuccess
--- PASS: TestOpenSuccess (0.00s)
PASS
coverage: 92.1% of statements
ok      github.com/NVIDIA/go-nvml/pkg/dl        0.010s  coverage: 92.1% of statements
=== RUN   TestInit
    nvml_test.go:26: Init: 0
    nvml_test.go:33: Shutdown: 0
--- PASS: TestInit (2.53s)
=== RUN   TestSystem
    nvml_test.go:45: SystemGetDriverVersion: 0
    nvml_test.go:46:   version: 525.105.17
    nvml_test.go:53: SystemGetNVMLVersion: 0
    nvml_test.go:54:   version: 12.525.105.17
    nvml_test.go:61: SystemGetCudaDriverVersion: 0
    nvml_test.go:62:   version: 12000
    nvml_test.go:69: SystemGetCudaDriverVersion_v2: 0
    nvml_test.go:70:   version: 12000
    nvml_test.go:77: SystemGetProcessName: 0
    nvml_test.go:78:   name: /sbin/init
    nvml_test.go:85: SystemGetHicVersion: 0
    nvml_test.go:86:   count: 0
    nvml_test.go:96: SystemGetTopologyGpuSet: 0
    nvml_test.go:97:   count: 1
    nvml_test.go:99:   device[0]: {0x7efd5b407bd8}
--- PASS: TestSystem (2.51s)
=== RUN   TestUnit
    nvml_test.go:112: UnitGetCount: 0
    nvml_test.go:113:   count: 0
    nvml_test.go:117: Skipping test with no Units.
--- SKIP: TestUnit (2.50s)
=== RUN   TestEventSet
    nvml_test.go:253: EventSetCreate: 0
    nvml_test.go:254:   set: {0x1bb1570}
    nvml_test.go:261: EventSetWait: 10
    nvml_test.go:262:   data: {{<nil>} 0 0 0 0}
    nvml_test.go:269: EventSet.Wait: 10
    nvml_test.go:270:   data: {{<nil>} 0 0 0 0}
    nvml_test.go:277: EventSetFree: 0
    nvml_test.go:285: EventSet.Free: 0
--- PASS: TestEventSet (2.52s)
PASS
coverage: 5.7% of statements
ok      github.com/NVIDIA/go-nvml/pkg/nvml      10.069s coverage: 5.7% of statements

I also tested with go1.20.8 and the tests all pass.

klueska commented 1 year ago

Thanks for taking the time to dig into this. I'm happy to merge this given your detailed explanation of the root cause and the minimal change required to fix things.

However, I strangely still don't see this behaviour on my system on a fresh checkout of main:

$ go version
go version go1.21.1 linux/amd64

... uncomment test in Makefile ...

$ make test
cp /home/kklues/go-nvml/gen/nvml/nvml.yml /home/kklues/go-nvml/pkg/nvml
c-for-go -out /home/kklues/go-nvml/pkg /home/kklues/go-nvml/pkg/nvml/nvml.yml
  processing /home/kklues/go-nvml/pkg/nvml/nvml.yml done.
cp /home/kklues/go-nvml/gen/nvml/*.go /home/kklues/go-nvml/pkg/nvml
cd /home/kklues/go-nvml/pkg/nvml; \
    go tool cgo -godefs types.go > types_gen.go; \
    go fmt types_gen.go; \
cd -> /dev/null
types_gen.go
rm -rf /home/kklues/go-nvml/pkg/nvml/nvml.yml /home/kklues/go-nvml/pkg/nvml/types.go /home/kklues/go-nvml/pkg/nvml/_obj
grep -l -R "// WARNING: This file has automatically been generated on" pkg \
    | xargs sed -i -E 's#// WARNING: This file has automatically been generated on.*$#// WARNING: THIS FILE WAS AUTOMATICALLY GENERATED.#g'
grep -l -RE "// (.*) nvml/nvml.h:[0-9]+" pkg \
    | xargs sed -i -E 's#// (.*) nvml/nvml.h:[0-9]+$#// \1 nvml/nvml.h#g'
GOOS=linux go build github.com/NVIDIA/go-nvml/pkg/...
go test -v -coverprofile=coverage.out github.com/NVIDIA/go-nvml/pkg/...
=== RUN   TestNew
=== PAUSE TestNew
=== RUN   TestOpenSuccess
=== PAUSE TestOpenSuccess
=== RUN   TestOpenFailed
=== PAUSE TestOpenFailed
=== RUN   TestOpenTwice
=== PAUSE TestOpenTwice
=== RUN   TestClose
=== PAUSE TestClose
=== RUN   TestLookupSuccess
=== PAUSE TestLookupSuccess
=== RUN   TestLookupFailed
=== PAUSE TestLookupFailed
=== CONT  TestNew
--- PASS: TestNew (0.00s)
=== CONT  TestOpenTwice
=== CONT  TestClose
--- PASS: TestOpenTwice (0.00s)
--- PASS: TestClose (0.00s)
=== CONT  TestLookupFailed
=== CONT  TestOpenFailed
=== CONT  TestLookupSuccess
=== CONT  TestOpenSuccess
--- PASS: TestLookupFailed (0.00s)
--- PASS: TestLookupSuccess (0.00s)
--- PASS: TestOpenSuccess (0.00s)
--- PASS: TestOpenFailed (0.00s)
PASS
coverage: 92.1% of statements
ok      github.com/NVIDIA/go-nvml/pkg/dl    0.005s  coverage: 92.1% of statements
=== RUN   TestInit
    nvml_test.go:26: Init: 0
    nvml_test.go:33: Shutdown: 0
--- PASS: TestInit (0.02s)
=== RUN   TestSystem
    nvml_test.go:45: SystemGetDriverVersion: 0
    nvml_test.go:46:   version: 525.85.12
    nvml_test.go:53: SystemGetNVMLVersion: 0
    nvml_test.go:54:   version: 12.525.85.12
    nvml_test.go:61: SystemGetCudaDriverVersion: 0
    nvml_test.go:62:   version: 12000
    nvml_test.go:69: SystemGetCudaDriverVersion_v2: 0
    nvml_test.go:70:   version: 12000
    nvml_test.go:77: SystemGetProcessName: 0
    nvml_test.go:78:   name: /sbin/init
    nvml_test.go:85: SystemGetHicVersion: 0
    nvml_test.go:86:   count: 0
    nvml_test.go:96: SystemGetTopologyGpuSet: 0
    nvml_test.go:97:   count: 0
--- PASS: TestSystem (0.22s)
=== RUN   TestUnit
    nvml_test.go:112: UnitGetCount: 0
    nvml_test.go:113:   count: 0
    nvml_test.go:117: Skipping test with no Units.
--- SKIP: TestUnit (0.02s)
=== RUN   TestEventSet
    nvml_test.go:253: EventSetCreate: 0
    nvml_test.go:254:   set: {0x223e9a0}
    nvml_test.go:261: EventSetWait: 10
    nvml_test.go:262:   data: {{<nil>} 0 0 0 0}
    nvml_test.go:269: EventSet.Wait: 10
    nvml_test.go:270:   data: {{<nil>} 0 0 0 0}
    nvml_test.go:277: EventSetFree: 0
    nvml_test.go:285: EventSet.Free: 0
--- PASS: TestEventSet (0.01s)
PASS
coverage: 5.6% of statements
ok      github.com/NVIDIA/go-nvml/pkg/nvml  0.285s  coverage: 5.6% of statements
klueska commented 1 year ago

Reading through your issue against the golang repo more closely, it's likely because my system has ld version 2.30:

$ ld --version
GNU ld (GNU Binutils for Ubuntu) 2.30

You mention that the bug surfaces with golang 1.21.x and ld > 2.38.

braydonk commented 1 year ago

Thank you for taking a look @klueska!