Closed kthust closed 1 year ago
It looks like it's segfaulting on the actual underyling call to nvmlInit()
in the C shared library.
What happens if you compile / run this simple C program:
$ cat main.c
#include <stdio.h>
typedef int nvmlReturn_t;
nvmlReturn_t nvmlInit_v2(void);
const char* nvmlErrorString(nvmlReturn_t result);
int main(int argc, char **argv)
{
nvmlReturn_t err = nvmlInit_v2();
printf("Return: %s\n", nvmlErrorString(err));
}
$ gcc -o main main.c -lnvidia-ml
$ ./main
Return: Success
This works without issues and seems to pick up the same library as the dynamic linker loads in the go version:
$ cat main.c
#include <stdio.h>
typedef int nvmlReturn_t;
nvmlReturn_t nvmlInit_v2(void);
const char* nvmlErrorString(nvmlReturn_t result);
int main(int argc, char **argv)
{
nvmlReturn_t err = nvmlInit_v2();
printf("Return: %s\n", nvmlErrorString(err));
}
$ gcc -o main main.c -lnvidia-ml
$ ./main
Return: Success
$ ldd main | grep nvidia
libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f8e902a1000)
$ LD_DEBUG=all ./main-go |& grep -E 'object=.*libnvidia-ml' | uniq
18418: object=/usr/lib64/libnvidia-ml.so.1 [0]
That is strange indeed.
When I do the following things work as expcted on my machine:
$ go version
go version go1.15.3 linux/amd64
$ mkdir test
$ cd test
$ cat > main.go << EOF
package main
import (
"github.com/NVIDIA/go-nvml/pkg/nvml"
)
func main() {
nvml.Init()
}
EOF
$ go mod init test
go: creating new go.mod: module test
$ go mod vendor
go: finding module for package github.com/NVIDIA/go-nvml/pkg/nvml
go: downloading github.com/NVIDIA/go-nvml v0.11.1-0
go: found github.com/NVIDIA/go-nvml/pkg/nvml in github.com/NVIDIA/go-nvml v0.11.1-0
$ go build main.go
$ ./main
The only difference seems to be my driver version (460.91.03) and my OS (ubuntu 20.04).
Following the mod
creation you describe, I get the same version v0.11.1-0
.
My intended use case is extending nvidia_gpu_prometheus_exporter by some metrics we need but which are currently not part of its bindings, e.g. nvmlDeviceGetClockInfo
.
Surprisingly Initialize seems to call nvmlInit_v2 without problems on my setup (as did the C
example you proposed above).
Yeah, I'm not sure what could be causing this issue, even the following seems to work for me:
package main
import (
"github.com/mindprince/gonvml"
"github.com/NVIDIA/go-nvml/pkg/nvml"
)
func main() {
nvml.Init()
gonvml.Initialize()
}
I thought maybe there was some weird conflicts that might occur if you had both nvml implementations vendored in.
That said, if your goal is to extend this prommetheus exporter to include more metrics, let me point you at an alternative prometheus exporter, which is the "official" prometheus exporter developed by NVIDIA. It's based on the DCGM framework (rather than NVML) to gather / publish more comprehensive metrics that what you can get out of NVML alone.
https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html
It may have what you are looking for already.
Thanks a lot for this pointer. I was not aware of this exporter and on first sight it looks like it might indeed be what I am looking for. I really appreciated the simplicity of the other exporter (just a go get
away), but I tried this one and got a segfault, too. But this might be better suited for an issue in the according repo, right?
I still find it strange that you would be getting this segfault, but yes, if you plan to start leveraging the dcgm-exporter
, then filing an issue there seems more appropriate. Please let us know if you ever get to the bottom of the issue.
I'm experiencing this issue too and did a bit of digging to figure out what is going wrong.
Firstly, I found that the SIGSEGV is due to the cgo wrapper function calling a null function pointer:
objdump --disassemble=_cgo_c9378bcb7609_Cfunc_nvmlInit_v2 gpu_test_bin
0000000000ce400c <_cgo_c9378bcb7609_Cfunc_nvmlInit_v2>:
ce400c: 55 push %rbp
ce400d: 48 89 e5 mov %rsp,%rbp
ce4010: 48 83 ec 30 sub $0x30,%rsp
ce4014: 48 89 7d d8 mov %rdi,-0x28(%rbp)
ce4018: 48 8b 45 d8 mov -0x28(%rbp),%rax
ce401c: 48 89 45 f8 mov %rax,-0x8(%rbp)
ce4020: e8 9b fd 78 ff callq 473dc0 <_cgo_topofstack>
ce4025: 48 89 45 f0 mov %rax,-0x10(%rbp)
ce4029: e8 d2 bf 31 ff callq 0 <runtime.tlsg>
ce402e: 89 45 ec mov %eax,-0x14(%rbp)
ce4031: e8 8a fd 78 ff callq 473dc0 <_cgo_topofstack>
ce4036: 48 2b 45 f0 sub -0x10(%rbp),%rax
ce403a: 48 01 45 f8 add %rax,-0x8(%rbp)
ce403e: 48 8b 45 f8 mov -0x8(%rbp),%rax
ce4042: 8b 55 ec mov -0x14(%rbp),%edx
ce4045: 89 10 mov %edx,(%rax)
ce4047: 90 nop
ce4048: c9 leaveq
ce4049: c3 retq
Note that at address ce4029
inside _cgo_7796e00b9739_Cfunc_nvmlInit_v2
that callq 0x0
is trying to call a null function pointer. I believe this should be the address of the PLT entry for nvmlInit_v2?
To test this theory I created a simple cgo program which links against libm and calls the cos()
function. Disassembling the cgo wrapper:
objdump --disassemble=_cgo_e71f5acfad90_Cfunc_cos gpu_test_bin
0000000000cdeabf <_cgo_e71f5acfad90_Cfunc_cos>:
cdeabf: 55 push %rbp
cdeac0: 48 89 e5 mov %rsp,%rbp
cdeac3: 48 83 ec 30 sub $0x30,%rsp
cdeac7: 48 89 7d d8 mov %rdi,-0x28(%rbp)
cdeacb: 48 8b 45 d8 mov -0x28(%rbp),%rax
cdeacf: 48 89 45 f8 mov %rax,-0x8(%rbp)
cdead3: e8 e8 52 79 ff callq 473dc0 <_cgo_topofstack>
cdead8: 48 89 45 f0 mov %rax,-0x10(%rbp)
cdeadc: 48 8b 45 f8 mov -0x8(%rbp),%rax
cdeae0: 48 8b 00 mov (%rax),%rax
cdeae3: 66 48 0f 6e c0 movq %rax,%xmm0
cdeae8: e8 33 fe 72 ff callq 40e920 <cos@plt>
cdeaed: 66 48 0f 7e c0 movq %xmm0,%rax
cdeaf2: 48 89 45 e8 mov %rax,-0x18(%rbp)
cdeaf6: e8 c5 52 79 ff callq 473dc0 <_cgo_topofstack>
cdeafb: 48 2b 45 f0 sub -0x10(%rbp),%rax
cdeaff: 48 01 45 f8 add %rax,-0x8(%rbp)
cdeb03: 48 8b 45 f8 mov -0x8(%rbp),%rax
cdeb07: f2 0f 10 45 e8 movsd -0x18(%rbp),%xmm0
cdeb0c: f2 0f 11 40 08 movsd %xmm0,0x8(%rax)
cdeb11: 90 nop
cdeb12: c9 leaveq
cdeb13: c3 retq
And we can see that the call at address cdeae8
does in fact call the PLT as expected.
So I believe the problem here is that the nvml cgo wrappers do not jump to the PLT entries for the nvml functions from libnvidia-ml.so. The next question is what part of the cgo toolchain is failing to put the PLT entries within the cgo function wrappers?
Any ideas @klueska ?
Do you have a simple reproducer program with a set of build flags?
I will try to create a simple reproducer.
FYI: The environment I initially built in is a bit peculiar. Especially it has LD_LIBRARY_PATH
set. Not cleaning this properly leads to the errors I described above. With this default environment with LD_LIBRARY_PATH
set I can still reproduce the error I described above but with a clean environment I cannot reproduce it. This also allowed me to build and run the DCGM exporter as proposed above.
Thanks again for the hint and sorry for the noise.
Are you able to post a full example @kthust ? We are building this in bazel with a custom toolchain, so I might have a bit of trouble trying to reproduce this in a minimal example.
The minimum example I mentioned in the previous post is the one I initially posted
[exporter]$ cat main.go
...
Since I usually do not work with go I needed some tries along the lines of go mod init example/hello
and go mod tidy
to download the module again, since I seem to have removed it since the last time I tested. With that out of the way I just used the commands and minimum code example from the initial post.
We are using EasyBuild at our site and since we provide a Go module it was my easiest way to have access to a build environment (or so I thought). By default we load some standard modules, like GCCcore
, zlib
and binutils
.
I just compared the working and the broken environment and the LD_LIBRARY_PATH
does not seem to be the culprit, but PATH
. If it contains <base_path>/GCCcore/11.2.0/bin
before /bin/
when building go build main.go
, then the resulting binary segfaults (independent of the PATH
when it is run).
Are you able to post the output of go build -x
during the broken build and then the non-broken build so we can compare? It seems like there is an issue with one of the GCC versions which is being used.
I tested the broken and working build. Maybe the most relevant part might be -extld=gcc
, which will use a different gcc
with the different PATH
.
The two gcc
versions are
# broken
$ gcc --version | grep gcc
gcc (GCC) 11.2.0
# working
$ gcc --version | grep gcc
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-15)
The newer version is provided via EasyBuild, the old version is from the normal system package repository.
@kthust I can confirm that we are on GCC 11.2.0 as well so this appears to be an issue with the GCC version? It is unclear if this is an actual bug with GCC or if there was a behaviour change?
@klueska any chance you can try the build on GCC 11.2.0 and see if you can replicate the issue?
I have made a breakthrough. It appears that this issue is triggered when using the gnu gold linker.
Here is a simple reproducer:
package main
import "github.com/NVIDIA/go-nvml/pkg/nvml"
func main() {
nvml.Init()
}
go build -a --ldflags '-extldflags "-fuse-ld=gold"'
Running the resulting binary will segfault. I have verified this across multiple gcc and go versions. The question now is this a bug in gnu gold, or is go-nvml relying on undocumented behaviour of gnu ld to function?
@klueska
Does it also segfault with the gold linker on a C-based NVML program, e.g. the one I specified here: https://github.com/NVIDIA/go-nvml/issues/36#issuecomment-1024535333
Also, for reference, these are the LDFLAGS and CFLAGS that we set when you build against go-nvml: https://github.com/NVIDIA/go-nvml/blob/main/gen/nvml/nvml.yml#L35...L36
Can you try adding these flags to your go build
command. Unfortunately, this is not something we can set in the library itself, but users have reported needing it to avoid unresolved symbol errors (even without the gold linker). These users weren't segfaulting, but maybe with the gold linker there is a different symptom to the same underlying issue:
go build -ldflags="-extldflags=-Wl,-z,lazy" <files>.go
passing -z lazy
to ld.gold didn't fix the issue.
go build -a --ldflags '-extldflags "-fuse-ld=gold -Wl,-z,lazy"'
Lazy binding is the default option so I wouldn't expect passing the option on the command line to make a difference
ld.gold --help | grep lazy
-z lazy Mark object for lazy runtime binding (default)
Just to be clear, this issue is reproducible on a stock install of ubuntu 20/22 using this example: https://github.com/NVIDIA/go-nvml/issues/36#issuecomment-1470962928
I tested ld.gold using the latest binutils (2.40) and still got the same result where the binary segfaults.
I'm now trying to see if I can get a minimal C reproducer using ld.gold. For reference I was able to dump the temporary cgo build outputs before the compiler is invoked:
CGO_NO_SANITIZE_THREAD
void
_cgo_c813f6172e91_Cfunc_nvmlInit_v2(void *v)
{
struct {
nvmlReturn_t r;
char __pad4[4];
} __attribute__((__packed__, __gcc_struct__)) *_cgo_a = v;
char *_cgo_stktop = _cgo_topofstack();
__typeof__(_cgo_a->r) _cgo_r;
_cgo_tsan_acquire();
_cgo_r = nvmlInit_v2();
_cgo_tsan_release();
_cgo_a = (void*)((char*)_cgo_a + (_cgo_topofstack() - _cgo_stktop));
_cgo_a->r = _cgo_r;
_cgo_msan_write(&_cgo_a->r, sizeof(_cgo_a->r));
}
Which compiles to this object code before linking:
0000000000003e30 <_cgo_c813f6172e91_Cfunc_nvmlInit_v2>:
3e30: f3 0f 1e fa endbr64
3e34: 41 54 push %r12
3e36: 55 push %rbp
3e37: 53 push %rbx
3e38: 48 89 fb mov %rdi,%rbx
3e3b: e8 00 00 00 00 call 3e40 <_cgo_c813f6172e91_Cfunc_nvmlInit_v2+0x10>
3e40: 49 89 c4 mov %rax,%r12
3e43: e8 00 00 00 00 call 3e48 <_cgo_c813f6172e91_Cfunc_nvmlInit_v2+0x18>
3e48: 89 c5 mov %eax,%ebp
3e4a: e8 00 00 00 00 call 3e4f <_cgo_c813f6172e91_Cfunc_nvmlInit_v2+0x1f>
3e4f: 4c 29 e0 sub %r12,%rax
3e52: 89 2c 03 mov %ebp,(%rbx,%rax,1)
3e55: 5b pop %rbx
3e56: 5d pop %rbp
3e57: 41 5c pop %r12
3e59: c3 ret
Another breakthrough. I have found that gold and ld differ on how they handle unresolved symbols. When ld finds an unresolved symbol it will create a PLT entry for that symbol, whereas when gold finds an unresolved symbol it will simply set the jump address to 0x0 (as seen in earlier disassembly) which breaks the way go-nvml does lazy binding with dlopen.
Luckly there is a solution with gold: if you pass --weak-unresolved-symbols
to gold when linking with go-nvml gold will create a plt entry for unresolved symbols and those plt entries will be filled in by dlopen and everything works. --weak-unresolved-symbols
is only an option in gold and not ld so go-nvml needs to figure out if we are linking with gold or ld and pass --unresolved-symbols=ignore-in-object-files
if we are using ld and --weak-unresolved-symbols
if we are using gold.
Using the go reproducer we can verify that these command line options successfully create a binary with gold which doesn't segfault:
go build -a --ldflags '-extldflags "-fuse-ld=gold -Wl,--weak-unresolved-symbols"'
Thanks for getting to the bottom of this!
However I don’t see how this can be addressed at the go-nvml layer if this flag is specific to the gold linker (so we can’t just plop it into our static setting for LDFAGS).
It would need to be applied by whoever is actually building an application that imports go-nvml and choosing to use the gold linker.
It might be possible to use go build constraints to change the LD flags based upon the linker being used? From the cgo docs: https://pkg.go.dev/cmd/cgo
CFLAGS, CPPFLAGS, CXXFLAGS, FFLAGS and LDFLAGS may be defined with pseudo #cgo directives within these comments to tweak the behavior of the C, C++ or Fortran compiler. Values defined in multiple directives are concatenated together. The directive can include a list of build constraints limiting its effect to systems satisfying one of the constraints (see https://golang.org/pkg/go/build/#hdr-Build_Constraints for details about the constraint syntax)
Here is where we could use build constraints to change ld flags: https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/nvml.go#L21
We already do that for the unresolved symbols flag (indirectly through our use of c-for-go under the hood, which is why you see them turning up in a generated file).
These are set based on what I sent previously in the link for: https://github.com/NVIDIA/go-nvml/blob/main/gen/nvml/nvml.yml#L35...L36
I can’t imagine we could dynamically detect which linker is being used and apply different build constraints based on that though. I will take a look through the docs to see, but I’m not optimistic.
Thanks again for getting to the bottom of this. At the very least we know what flags to tell people to add themselves if they encounter this in the future.
If I get the chance I will also follow up with the binutils maintainers to see if this is expected behaviour of gold.
It would be really nice if we can handle this in go-nvml. go-nvml relies on specific behaviour of ld which breaks with gold, and users are not going to know that.
We have a dependency on go-nvml
and we started hitting this segfault when we upgraded our build environment from Go 1.20 to 1.21. We don't use any linker flags in our build, just plain go build
.
Is the guidance from https://github.com/NVIDIA/go-nvml/issues/36#issuecomment-1472853933 still currently the best known workaround? Any ideas on why the Go version upgrade to 1.21 would trigger this?
Can confirm that with go 1.21.1 the segfault is always triggered. This was not the case with go 1.20.8.
On a side note, I found this bug while trying to manually compile nvidia-container-toolkit. If compiled with go 1.20.8, everything is ok, if compiled with 1.21 the following message appears when trying to run nvidia-ctk
or nvidia-container-runtime
:
nvidia-ctk: error while loading shared libraries: unexpected PLT reloc type 0x00
This is consistent with the earlier findings of elias-dbx about gold setting PLT entry of 0x0. No idea why it happens only with go 1.21 though
@davidepi I note that we have https://github.com/NVIDIA/nvidia-container-toolkit/issues/101 also opened against a tool that consumes go-nvml
. In which environment (distribution) are you manually compiling the nvidia-container-toolkit
?
I'd like to confirm that this segfault has started happening when using Go 1.21.1.
When the symbol lookup happens through dlopen
and dlsym
, the symbol is found. However, when calling the symbol through the generated bindings, all symbols result in 0x0
.
I'm happy to provide any additional debug information as I am actively looking into this issue and trying to resolve it.
Another piece of info that might be useful:
We're not sure why yet, but on Ubuntu 20.04 this still seems to work with Go 1.21.1. However, when we run with Go 1.21.1 on Debian 11 or Ubuntu 22.04, we get the issue I mentioned above.
Also to clarify, this is happening on any symbol. I tried going into library code and forcing it to run other random symbols, and every one I tried resulted in the invalid address 0x0 panic.
@elezar fresh installation of Gentoo, which is unsupported, but I see that braydonk confirmed that happens also on Ubuntu and Debian.
As you noted in nvidia-container-toolkit, I believe this is the same problem as https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/issues/17 Checking the date of that issue, it matches with the release of go 1.21 that happened a couple of days prior. However I lack information about how that package was built so I cannot be 100% sure.
I tried the same solution on that post, by downloading already existing binaries, but it segfaults when running nvidia-ctk
. With these precompiled binaries the problem is not when starting them without arguments but when generating the CDI spec. I just copied the compiled binaries, so I can guess it was still calling the system-provided go.
I was able to reproduce this without go-nvml
at all. I opened an issue in the Go repo. https://github.com/golang/go/issues/63264
I found the reason for the issue in go1.21.x
. You can see my investigation and eventual findings in the mentioned golang/go issue. I have opened a PR that seems to resolve this issue based on my findings.
Keep the gcc = gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)
nvml.Init
segfaults. These are the steps to reproduce and the setup used for the test: