Closed braydonk closed 1 year ago
Thanks for taking the time to dig into this. I'm happy to merge this given your detailed explanation of the root cause and the minimal change required to fix things.
However, I strangely still don't see this behaviour on my system on a fresh checkout of main
:
$ go version
go version go1.21.1 linux/amd64
... uncomment test in Makefile ...
$ make test
cp /home/kklues/go-nvml/gen/nvml/nvml.yml /home/kklues/go-nvml/pkg/nvml
c-for-go -out /home/kklues/go-nvml/pkg /home/kklues/go-nvml/pkg/nvml/nvml.yml
processing /home/kklues/go-nvml/pkg/nvml/nvml.yml done.
cp /home/kklues/go-nvml/gen/nvml/*.go /home/kklues/go-nvml/pkg/nvml
cd /home/kklues/go-nvml/pkg/nvml; \
go tool cgo -godefs types.go > types_gen.go; \
go fmt types_gen.go; \
cd -> /dev/null
types_gen.go
rm -rf /home/kklues/go-nvml/pkg/nvml/nvml.yml /home/kklues/go-nvml/pkg/nvml/types.go /home/kklues/go-nvml/pkg/nvml/_obj
grep -l -R "// WARNING: This file has automatically been generated on" pkg \
| xargs sed -i -E 's#// WARNING: This file has automatically been generated on.*$#// WARNING: THIS FILE WAS AUTOMATICALLY GENERATED.#g'
grep -l -RE "// (.*) nvml/nvml.h:[0-9]+" pkg \
| xargs sed -i -E 's#// (.*) nvml/nvml.h:[0-9]+$#// \1 nvml/nvml.h#g'
GOOS=linux go build github.com/NVIDIA/go-nvml/pkg/...
go test -v -coverprofile=coverage.out github.com/NVIDIA/go-nvml/pkg/...
=== RUN TestNew
=== PAUSE TestNew
=== RUN TestOpenSuccess
=== PAUSE TestOpenSuccess
=== RUN TestOpenFailed
=== PAUSE TestOpenFailed
=== RUN TestOpenTwice
=== PAUSE TestOpenTwice
=== RUN TestClose
=== PAUSE TestClose
=== RUN TestLookupSuccess
=== PAUSE TestLookupSuccess
=== RUN TestLookupFailed
=== PAUSE TestLookupFailed
=== CONT TestNew
--- PASS: TestNew (0.00s)
=== CONT TestOpenTwice
=== CONT TestClose
--- PASS: TestOpenTwice (0.00s)
--- PASS: TestClose (0.00s)
=== CONT TestLookupFailed
=== CONT TestOpenFailed
=== CONT TestLookupSuccess
=== CONT TestOpenSuccess
--- PASS: TestLookupFailed (0.00s)
--- PASS: TestLookupSuccess (0.00s)
--- PASS: TestOpenSuccess (0.00s)
--- PASS: TestOpenFailed (0.00s)
PASS
coverage: 92.1% of statements
ok github.com/NVIDIA/go-nvml/pkg/dl 0.005s coverage: 92.1% of statements
=== RUN TestInit
nvml_test.go:26: Init: 0
nvml_test.go:33: Shutdown: 0
--- PASS: TestInit (0.02s)
=== RUN TestSystem
nvml_test.go:45: SystemGetDriverVersion: 0
nvml_test.go:46: version: 525.85.12
nvml_test.go:53: SystemGetNVMLVersion: 0
nvml_test.go:54: version: 12.525.85.12
nvml_test.go:61: SystemGetCudaDriverVersion: 0
nvml_test.go:62: version: 12000
nvml_test.go:69: SystemGetCudaDriverVersion_v2: 0
nvml_test.go:70: version: 12000
nvml_test.go:77: SystemGetProcessName: 0
nvml_test.go:78: name: /sbin/init
nvml_test.go:85: SystemGetHicVersion: 0
nvml_test.go:86: count: 0
nvml_test.go:96: SystemGetTopologyGpuSet: 0
nvml_test.go:97: count: 0
--- PASS: TestSystem (0.22s)
=== RUN TestUnit
nvml_test.go:112: UnitGetCount: 0
nvml_test.go:113: count: 0
nvml_test.go:117: Skipping test with no Units.
--- SKIP: TestUnit (0.02s)
=== RUN TestEventSet
nvml_test.go:253: EventSetCreate: 0
nvml_test.go:254: set: {0x223e9a0}
nvml_test.go:261: EventSetWait: 10
nvml_test.go:262: data: {{<nil>} 0 0 0 0}
nvml_test.go:269: EventSet.Wait: 10
nvml_test.go:270: data: {{<nil>} 0 0 0 0}
nvml_test.go:277: EventSetFree: 0
nvml_test.go:285: EventSet.Free: 0
--- PASS: TestEventSet (0.01s)
PASS
coverage: 5.6% of statements
ok github.com/NVIDIA/go-nvml/pkg/nvml 0.285s coverage: 5.6% of statements
Reading through your issue against the golang repo more closely, it's likely because my system has ld
version 2.30:
$ ld --version
GNU ld (GNU Binutils for Ubuntu) 2.30
You mention that the bug surfaces with golang
1.21.x and ld
> 2.38.
Thank you for taking a look @klueska!
Closes #36
Investigation on the Go side https://github.com/golang/go/issues/63264
When built with
go1.21.x
,go-nvml
panics. This is due to a change in howgo tool link
executes the external linker. Previously,-rdynamic
was always passed to the external linker, however this was changed to only happen in certain circumstances, such as when linking to previously built CGO shared objects, or when the flag to export specific dynamic symbols isn't supported. The justification for the change is in https://github.com/golang/go/issues/53579.The resolution I propose for
go-nvml
is to pass the--export-dynamic
flag intoLDFLAGS
explicitly. This LDFLAG is required because if it is not present, the symbols fromnvml.h
are not added to the PLT, meaning each call to a symbol fromnvml.h
will have an address of0x0
instead of a PLT offset (at least in ELF AMD64, though I assume some equivalent occurs in other setups).Passing this flag explicitly means whether it's built with a Go version that always applies
-rdynamic
or not, this required flag will be present.I tested this PR by uncommenting the commented out test command from
make test
. On the main branch, it passes alldl
tests, then panics.On this branch, all tests pass.
I also tested with
go1.20.8
and the tests all pass.