Segfault on nvml.Init - Githubissues

kthust commented 2 years ago

nvml.Init segfaults. These are the steps to reproduce and the setup used for the test:

[exporter]$ cat main.go 
package main

import (
    "github.com/NVIDIA/go-nvml/pkg/nvml"
)

func main() {
    nvml.Init()
}

[exporter]$ go build main.go 

[exporter]$ ./main 
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x0]

runtime stack:
runtime.throw(0x4daeb1, 0x2a)
    /path/to/Go/1.15.3/src/runtime/panic.go:1116 +0x72
runtime.sigpanic()
    /path/to/Go/1.15.3/src/runtime/signal_unix.go:704 +0x4ac

goroutine 1 [syscall]:
runtime.cgocall(0x4ae900, 0xc000144ed8, 0xc0001a8020)
    /path/to/Go/1.15.3/src/runtime/cgocall.go:133 +0x5b fp=0xc000144ea8 sp=0xc000144e70 pc=0x41049b
github.com/NVIDIA/go-nvml/pkg/nvml._Cfunc_nvmlInit_v2(0x0)
    _cgo_gotypes.go:3398 +0x49 fp=0xc000144ed8 sp=0xc000144ea8 pc=0x4a9fc9
github.com/NVIDIA/go-nvml/pkg/nvml.nvmlInit_v2(0xc0001a8020)
    /my/home/go/src/github.com/NVIDIA/go-nvml/pkg/nvml/nvml.go:32 +0x25 fp=0xc000144ef0 sp=0xc000144ed8 pc=0x4aa805
github.com/NVIDIA/go-nvml/pkg/nvml.Init(0xc00018e058)
    /my/home/go/src/github.com/NVIDIA/go-nvml/pkg/nvml/init.go:47 +0xb1 fp=0xc000144f70 sp=0xc000144ef0 pc=0x4aa091
main.main()
    /my/home/tmp/exporter/main.go:8 +0x25 fp=0xc000144f88 sp=0xc000144f70 pc=0x4abf25
runtime.main()
    /path/to/Go/1.15.3/src/runtime/proc.go:204 +0x209 fp=0xc000144fe0 sp=0xc000144f88 pc=0x442369
runtime.goexit()
    /path/to/Go/1.15.3/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000144fe8 sp=0xc000144fe0 pc=0x470c61

[exporter]$ go version
go version go1.15.3 linux/amd64

[exporter]$ readlink -f /usr/lib64/libnvidia-ml.so
/usr/lib64/libnvidia-ml.so.470.82.01

[exporter]$ uname -r
4.18.0-348.2.1.el8_5.x86_64

[exporter]$ cat /etc/redhat-release 
Rocky Linux release 8.5 (Green Obsidian)

[exporter]$ nvidia-smi 
Fri Jan 28 19:34:18 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:03:00.0 Off |                    0 |
| N/A   46C    P0    66W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:44:00.0 Off |                    0 |
| N/A   45C    P0    58W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:84:00.0 Off |                    0 |
| N/A   44C    P0    56W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C4:00.0 Off |                    0 |
| N/A   44C    P0    64W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

klueska commented 2 years ago

It looks like it's segfaulting on the actual underyling call to nvmlInit() in the C shared library.

What happens if you compile / run this simple C program:

$ cat main.c
#include <stdio.h>

typedef int nvmlReturn_t;

nvmlReturn_t nvmlInit_v2(void);
const char* nvmlErrorString(nvmlReturn_t result);

int main(int argc, char **argv)
{
    nvmlReturn_t err = nvmlInit_v2();
    printf("Return: %s\n", nvmlErrorString(err));
}

$ gcc -o main main.c -lnvidia-ml
$ ./main
Return: Success

kthust commented 2 years ago

This works without issues and seems to pick up the same library as the dynamic linker loads in the go version:

$ cat main.c 
#include <stdio.h>

typedef int nvmlReturn_t;

nvmlReturn_t nvmlInit_v2(void);
const char* nvmlErrorString(nvmlReturn_t result);

int main(int argc, char **argv)
{
    nvmlReturn_t err = nvmlInit_v2();
    printf("Return: %s\n", nvmlErrorString(err));
}

$ gcc -o main main.c -lnvidia-ml
$ ./main 
Return: Success

$ ldd main | grep nvidia
    libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f8e902a1000)

$ LD_DEBUG=all ./main-go |& grep -E 'object=.*libnvidia-ml' | uniq
     18418: object=/usr/lib64/libnvidia-ml.so.1 [0]

klueska commented 2 years ago

That is strange indeed.

When I do the following things work as expcted on my machine:

$ go version
go version go1.15.3 linux/amd64

$ mkdir test
$ cd test

$ cat > main.go << EOF
package main

import (
    "github.com/NVIDIA/go-nvml/pkg/nvml"
)

func main() {
    nvml.Init()
}
EOF

$ go mod init test
go: creating new go.mod: module test

$ go mod vendor
go: finding module for package github.com/NVIDIA/go-nvml/pkg/nvml
go: downloading github.com/NVIDIA/go-nvml v0.11.1-0
go: found github.com/NVIDIA/go-nvml/pkg/nvml in github.com/NVIDIA/go-nvml v0.11.1-0

$ go build main.go
$ ./main

The only difference seems to be my driver version (460.91.03) and my OS (ubuntu 20.04).

kthust commented 2 years ago

Following the mod creation you describe, I get the same version v0.11.1-0.

My intended use case is extending nvidia_gpu_prometheus_exporter by some metrics we need but which are currently not part of its bindings, e.g. nvmlDeviceGetClockInfo.

Surprisingly Initialize seems to call nvmlInit_v2 without problems on my setup (as did the C example you proposed above).

klueska commented 2 years ago

Yeah, I'm not sure what could be causing this issue, even the following seems to work for me:

package main

import (
    "github.com/mindprince/gonvml"
    "github.com/NVIDIA/go-nvml/pkg/nvml"
)

func main() {
    nvml.Init()
    gonvml.Initialize()
}

I thought maybe there was some weird conflicts that might occur if you had both nvml implementations vendored in.

That said, if your goal is to extend this prommetheus exporter to include more metrics, let me point you at an alternative prometheus exporter, which is the "official" prometheus exporter developed by NVIDIA. It's based on the DCGM framework (rather than NVML) to gather / publish more comprehensive metrics that what you can get out of NVML alone.

https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html

It may have what you are looking for already.

kthust commented 2 years ago

Thanks a lot for this pointer. I was not aware of this exporter and on first sight it looks like it might indeed be what I am looking for. I really appreciated the simplicity of the other exporter (just a go get away), but I tried this one and got a segfault, too. But this might be better suited for an issue in the according repo, right?

klueska commented 2 years ago

I still find it strange that you would be getting this segfault, but yes, if you plan to start leveraging the dcgm-exporter, then filing an issue there seems more appropriate. Please let us know if you ever get to the bottom of the issue.

elias-dbx commented 1 year ago

I'm experiencing this issue too and did a bit of digging to figure out what is going wrong.

Firstly, I found that the SIGSEGV is due to the cgo wrapper function calling a null function pointer:

objdump --disassemble=_cgo_c9378bcb7609_Cfunc_nvmlInit_v2 gpu_test_bin

0000000000ce400c <_cgo_c9378bcb7609_Cfunc_nvmlInit_v2>:
  ce400c:   55                      push   %rbp
  ce400d:   48 89 e5                mov    %rsp,%rbp
  ce4010:   48 83 ec 30             sub    $0x30,%rsp
  ce4014:   48 89 7d d8             mov    %rdi,-0x28(%rbp)
  ce4018:   48 8b 45 d8             mov    -0x28(%rbp),%rax
  ce401c:   48 89 45 f8             mov    %rax,-0x8(%rbp)
  ce4020:   e8 9b fd 78 ff          callq  473dc0 <_cgo_topofstack>
  ce4025:   48 89 45 f0             mov    %rax,-0x10(%rbp)
  ce4029:   e8 d2 bf 31 ff          callq  0 <runtime.tlsg>
  ce402e:   89 45 ec                mov    %eax,-0x14(%rbp)
  ce4031:   e8 8a fd 78 ff          callq  473dc0 <_cgo_topofstack>
  ce4036:   48 2b 45 f0             sub    -0x10(%rbp),%rax
  ce403a:   48 01 45 f8             add    %rax,-0x8(%rbp)
  ce403e:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  ce4042:   8b 55 ec                mov    -0x14(%rbp),%edx
  ce4045:   89 10                   mov    %edx,(%rax)
  ce4047:   90                      nop
  ce4048:   c9                      leaveq
  ce4049:   c3                      retq

Note that at address ce4029 inside _cgo_7796e00b9739_Cfunc_nvmlInit_v2 that callq 0x0 is trying to call a null function pointer. I believe this should be the address of the PLT entry for nvmlInit_v2?

To test this theory I created a simple cgo program which links against libm and calls the cos() function. Disassembling the cgo wrapper:

objdump --disassemble=_cgo_e71f5acfad90_Cfunc_cos gpu_test_bin

0000000000cdeabf <_cgo_e71f5acfad90_Cfunc_cos>:
  cdeabf:   55                      push   %rbp
  cdeac0:   48 89 e5                mov    %rsp,%rbp
  cdeac3:   48 83 ec 30             sub    $0x30,%rsp
  cdeac7:   48 89 7d d8             mov    %rdi,-0x28(%rbp)
  cdeacb:   48 8b 45 d8             mov    -0x28(%rbp),%rax
  cdeacf:   48 89 45 f8             mov    %rax,-0x8(%rbp)
  cdead3:   e8 e8 52 79 ff          callq  473dc0 <_cgo_topofstack>
  cdead8:   48 89 45 f0             mov    %rax,-0x10(%rbp)
  cdeadc:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  cdeae0:   48 8b 00                mov    (%rax),%rax
  cdeae3:   66 48 0f 6e c0          movq   %rax,%xmm0
  cdeae8:   e8 33 fe 72 ff          callq  40e920 <cos@plt>
  cdeaed:   66 48 0f 7e c0          movq   %xmm0,%rax
  cdeaf2:   48 89 45 e8             mov    %rax,-0x18(%rbp)
  cdeaf6:   e8 c5 52 79 ff          callq  473dc0 <_cgo_topofstack>
  cdeafb:   48 2b 45 f0             sub    -0x10(%rbp),%rax
  cdeaff:   48 01 45 f8             add    %rax,-0x8(%rbp)
  cdeb03:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  cdeb07:   f2 0f 10 45 e8          movsd  -0x18(%rbp),%xmm0
  cdeb0c:   f2 0f 11 40 08          movsd  %xmm0,0x8(%rax)
  cdeb11:   90                      nop
  cdeb12:   c9                      leaveq
  cdeb13:   c3                      retq

And we can see that the call at address cdeae8 does in fact call the PLT as expected.

So I believe the problem here is that the nvml cgo wrappers do not jump to the PLT entries for the nvml functions from libnvidia-ml.so. The next question is what part of the cgo toolchain is failing to put the PLT entries within the cgo function wrappers?

Any ideas @klueska ?

klueska commented 1 year ago

Do you have a simple reproducer program with a set of build flags?

elias-dbx commented 1 year ago

I will try to create a simple reproducer.

kthust commented 1 year ago

FYI: The environment I initially built in is a bit peculiar. Especially it has LD_LIBRARY_PATH set. Not cleaning this properly leads to the errors I described above. With this default environment with LD_LIBRARY_PATH set I can still reproduce the error I described above but with a clean environment I cannot reproduce it. This also allowed me to build and run the DCGM exporter as proposed above.

Thanks again for the hint and sorry for the noise.

elias-dbx commented 1 year ago

Are you able to post a full example @kthust ? We are building this in bazel with a custom toolchain, so I might have a bit of trouble trying to reproduce this in a minimal example.

kthust commented 1 year ago

The minimum example I mentioned in the previous post is the one I initially posted

[exporter]$ cat main.go
...

Since I usually do not work with go I needed some tries along the lines of go mod init example/hello and go mod tidy to download the module again, since I seem to have removed it since the last time I tested. With that out of the way I just used the commands and minimum code example from the initial post.

We are using EasyBuild at our site and since we provide a Go module it was my easiest way to have access to a build environment (or so I thought). By default we load some standard modules, like GCCcore, zlib and binutils.

I just compared the working and the broken environment and the LD_LIBRARY_PATH does not seem to be the culprit, but PATH. If it contains <base_path>/GCCcore/11.2.0/bin before /bin/ when building go build main.go, then the resulting binary segfaults (independent of the PATH when it is run).

elias-dbx commented 1 year ago

Are you able to post the output of go build -x during the broken build and then the non-broken build so we can compare? It seems like there is an issue with one of the GCC versions which is being used.

kthust commented 1 year ago

I tested the broken and working build. Maybe the most relevant part might be -extld=gcc, which will use a different gcc with the different PATH.

The two gcc versions are

# broken
$ gcc --version | grep gcc
gcc (GCC) 11.2.0

# working
$ gcc --version | grep gcc
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-15)

The newer version is provided via EasyBuild, the old version is from the normal system package repository.

diff

```diff $ diff --color=always -u build_* --- build_broken 2023-02-23 14:26:46.285456594 +0100 +++ build_working 2023-02-23 14:25:33.140953735 +0100 @@ -1,11 +1,11 @@ -WORK=/tmp/go-build3884987067 +WORK=/tmp/go-build3607589321 mkdir -p $WORK/b001/ cat >$WORK/b001/importcfg.link << 'EOF' # internal -packagefile command-line-arguments=/p/home/jusers/thust1/jureca/.cache/go-build/44/441f9478f3b42d4e446f980c088b6bcd1e949b81007c85bf47bacd3fbe9705e8-d -packagefile github.com/NVIDIA/go-nvml/pkg/nvml=/p/home/jusers/thust1/jureca/.cache/go-build/01/01ecdc3ad9a069fdd53ca056e24338a51190990fdbdcf3aafd35a581ae3a1272-d +packagefile command-line-arguments=/p/home/jusers/thust1/jureca/.cache/go-build/a2/a2c243b73e2a11fa94ec3aecae169627c0e0503b098d0cf1de0408ed112d49d6-d +packagefile github.com/NVIDIA/go-nvml/pkg/nvml=/p/home/jusers/thust1/jureca/.cache/go-build/7c/7cffde010d00b40264bffe70c97198685cc5a37fade349da636ca142de0a9835-d packagefile runtime=/p/software/jurecadc/stages/2022/software/Go/1.17.3/pkg/linux_amd64/runtime.a packagefile fmt=/p/software/jurecadc/stages/2022/software/Go/1.17.3/pkg/linux_amd64/fmt.a -packagefile github.com/NVIDIA/go-nvml/pkg/dl=/p/home/jusers/thust1/jureca/.cache/go-build/bb/bb90be00d22d687f6989109015df4b87c93533e318955a4b2e347b6e6835c72c-d +packagefile github.com/NVIDIA/go-nvml/pkg/dl=/p/home/jusers/thust1/jureca/.cache/go-build/7f/7f1d6309dc90f73ab454fae523192bb309672b7be23beca314f4b170dbd35329-d packagefile reflect=/p/software/jurecadc/stages/2022/software/Go/1.17.3/pkg/linux_amd64/reflect.a packagefile runtime/cgo=/p/software/jurecadc/stages/2022/software/Go/1.17.3/pkg/linux_amd64/runtime/cgo.a packagefile syscall=/p/software/jurecadc/stages/2022/software/Go/1.17.3/pkg/linux_amd64/syscall.a @@ -43,7 +43,7 @@ EOF mkdir -p $WORK/b001/exe/ cd . -/p/software/jurecadc/stages/2022/software/Go/1.17.3/pkg/tool/linux_amd64/link -o $WORK/b001/exe/a.out -importcfg $WORK/b001/importcfg.link -buildmode=exe -buildid=ZMjV-a_WMkEgLv_nPpG8/ExL-lxdnvv7Klww4ZTtN/Hz1ci8PgaVLHtwTjIQfg/ZMjV-a_WMkEgLv_nPpG8 -extld=gcc /p/home/jusers/thust1/jureca/.cache/go-build/44/441f9478f3b42d4e446f980c088b6bcd1e949b81007c85bf47bacd3fbe9705e8-d +/p/software/jurecadc/stages/2022/software/Go/1.17.3/pkg/tool/linux_amd64/link -o $WORK/b001/exe/a.out -importcfg $WORK/b001/importcfg.link -buildmode=exe -buildid=twZdNdI2gQ86fClF_h-q/LoCydkskulkMYHoLu5ih/Hz1ci8PgaVLHtwTjIQfg/twZdNdI2gQ86fClF_h-q -extld=gcc /p/home/jusers/thust1/jureca/.cache/go-build/a2/a2c243b73e2a11fa94ec3aecae169627c0e0503b098d0cf1de0408ed112d49d6-d /p/software/jurecadc/stages/2022/software/Go/1.17.3/pkg/tool/linux_amd64/buildid -w $WORK/b001/exe/a.out # internal cp $WORK/b001/exe/a.out main rm -r $WORK/b001/ ```

elias-dbx commented 1 year ago

@kthust I can confirm that we are on GCC 11.2.0 as well so this appears to be an issue with the GCC version? It is unclear if this is an actual bug with GCC or if there was a behaviour change?

elias-dbx commented 1 year ago

@klueska any chance you can try the build on GCC 11.2.0 and see if you can replicate the issue?

elias-dbx commented 1 year ago

I have made a breakthrough. It appears that this issue is triggered when using the gnu gold linker.

Here is a simple reproducer:

package main

import "github.com/NVIDIA/go-nvml/pkg/nvml"

func main() {
    nvml.Init()
}

go build -a --ldflags '-extldflags "-fuse-ld=gold"'

Running the resulting binary will segfault. I have verified this across multiple gcc and go versions. The question now is this a bug in gnu gold, or is go-nvml relying on undocumented behaviour of gnu ld to function?

@klueska

klueska commented 1 year ago

Does it also segfault with the gold linker on a C-based NVML program, e.g. the one I specified here: https://github.com/NVIDIA/go-nvml/issues/36#issuecomment-1024535333

klueska commented 1 year ago

Also, for reference, these are the LDFLAGS and CFLAGS that we set when you build against go-nvml: https://github.com/NVIDIA/go-nvml/blob/main/gen/nvml/nvml.yml#L35...L36

Can you try adding these flags to your go build command. Unfortunately, this is not something we can set in the library itself, but users have reported needing it to avoid unresolved symbol errors (even without the gold linker). These users weren't segfaulting, but maybe with the gold linker there is a different symptom to the same underlying issue:

go build -ldflags="-extldflags=-Wl,-z,lazy" <files>.go

elias-dbx commented 1 year ago

passing -z lazy to ld.gold didn't fix the issue.

 go build -a --ldflags '-extldflags "-fuse-ld=gold -Wl,-z,lazy"'

Lazy binding is the default option so I wouldn't expect passing the option on the command line to make a difference

 ld.gold --help | grep lazy
  -z lazy                     Mark object for lazy runtime binding (default)

Just to be clear, this issue is reproducible on a stock install of ubuntu 20/22 using this example: https://github.com/NVIDIA/go-nvml/issues/36#issuecomment-1470962928

elias-dbx commented 1 year ago

I tested ld.gold using the latest binutils (2.40) and still got the same result where the binary segfaults.

I'm now trying to see if I can get a minimal C reproducer using ld.gold. For reference I was able to dump the temporary cgo build outputs before the compiler is invoked:

CGO_NO_SANITIZE_THREAD
void
_cgo_c813f6172e91_Cfunc_nvmlInit_v2(void *v)
{
        struct {
                nvmlReturn_t r;
                char __pad4[4];
        } __attribute__((__packed__, __gcc_struct__)) *_cgo_a = v;
        char *_cgo_stktop = _cgo_topofstack();
        __typeof__(_cgo_a->r) _cgo_r;
        _cgo_tsan_acquire();
        _cgo_r = nvmlInit_v2();
        _cgo_tsan_release();
        _cgo_a = (void*)((char*)_cgo_a + (_cgo_topofstack() - _cgo_stktop));
        _cgo_a->r = _cgo_r;
        _cgo_msan_write(&_cgo_a->r, sizeof(_cgo_a->r));
}

Which compiles to this object code before linking:

0000000000003e30 <_cgo_c813f6172e91_Cfunc_nvmlInit_v2>:
    3e30:   f3 0f 1e fa             endbr64
    3e34:   41 54                   push   %r12
    3e36:   55                      push   %rbp
    3e37:   53                      push   %rbx
    3e38:   48 89 fb                mov    %rdi,%rbx
    3e3b:   e8 00 00 00 00          call   3e40 <_cgo_c813f6172e91_Cfunc_nvmlInit_v2+0x10>
    3e40:   49 89 c4                mov    %rax,%r12
    3e43:   e8 00 00 00 00          call   3e48 <_cgo_c813f6172e91_Cfunc_nvmlInit_v2+0x18>
    3e48:   89 c5                   mov    %eax,%ebp
    3e4a:   e8 00 00 00 00          call   3e4f <_cgo_c813f6172e91_Cfunc_nvmlInit_v2+0x1f>
    3e4f:   4c 29 e0                sub    %r12,%rax
    3e52:   89 2c 03                mov    %ebp,(%rbx,%rax,1)
    3e55:   5b                      pop    %rbx
    3e56:   5d                      pop    %rbp
    3e57:   41 5c                   pop    %r12
    3e59:   c3                      ret

elias-dbx commented 1 year ago

Another breakthrough. I have found that gold and ld differ on how they handle unresolved symbols. When ld finds an unresolved symbol it will create a PLT entry for that symbol, whereas when gold finds an unresolved symbol it will simply set the jump address to 0x0 (as seen in earlier disassembly) which breaks the way go-nvml does lazy binding with dlopen.

Luckly there is a solution with gold: if you pass --weak-unresolved-symbols to gold when linking with go-nvml gold will create a plt entry for unresolved symbols and those plt entries will be filled in by dlopen and everything works. --weak-unresolved-symbols is only an option in gold and not ld so go-nvml needs to figure out if we are linking with gold or ld and pass --unresolved-symbols=ignore-in-object-files if we are using ld and --weak-unresolved-symbols if we are using gold.

Using the go reproducer we can verify that these command line options successfully create a binary with gold which doesn't segfault:

 go build -a --ldflags '-extldflags "-fuse-ld=gold -Wl,--weak-unresolved-symbols"'

klueska commented 1 year ago

Thanks for getting to the bottom of this!

However I don’t see how this can be addressed at the go-nvml layer if this flag is specific to the gold linker (so we can’t just plop it into our static setting for LDFAGS).

It would need to be applied by whoever is actually building an application that imports go-nvml and choosing to use the gold linker.

elias-dbx commented 1 year ago

It might be possible to use go build constraints to change the LD flags based upon the linker being used? From the cgo docs: https://pkg.go.dev/cmd/cgo

CFLAGS, CPPFLAGS, CXXFLAGS, FFLAGS and LDFLAGS may be defined with pseudo #cgo directives within these comments to tweak the behavior of the C, C++ or Fortran compiler. Values defined in multiple directives are concatenated together. The directive can include a list of build constraints limiting its effect to systems satisfying one of the constraints (see https://golang.org/pkg/go/build/#hdr-Build_Constraints for details about the constraint syntax)

Here is where we could use build constraints to change ld flags: https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/nvml.go#L21

klueska commented 1 year ago

We already do that for the unresolved symbols flag (indirectly through our use of c-for-go under the hood, which is why you see them turning up in a generated file).

These are set based on what I sent previously in the link for: https://github.com/NVIDIA/go-nvml/blob/main/gen/nvml/nvml.yml#L35...L36

I can’t imagine we could dynamically detect which linker is being used and apply different build constraints based on that though. I will take a look through the docs to see, but I’m not optimistic.

Thanks again for getting to the bottom of this. At the very least we know what flags to tell people to add themselves if they encounter this in the future.

elias-dbx commented 1 year ago

If I get the chance I will also follow up with the binutils maintainers to see if this is expected behaviour of gold.

It would be really nice if we can handle this in go-nvml. go-nvml relies on specific behaviour of ld which breaks with gold, and users are not going to know that.

jefferbrecht commented 1 year ago

We have a dependency on go-nvml and we started hitting this segfault when we upgraded our build environment from Go 1.20 to 1.21. We don't use any linker flags in our build, just plain go build.

Is the guidance from https://github.com/NVIDIA/go-nvml/issues/36#issuecomment-1472853933 still currently the best known workaround? Any ideas on why the Go version upgrade to 1.21 would trigger this?

davidepi commented 1 year ago

Can confirm that with go 1.21.1 the segfault is always triggered. This was not the case with go 1.20.8.

On a side note, I found this bug while trying to manually compile nvidia-container-toolkit. If compiled with go 1.20.8, everything is ok, if compiled with 1.21 the following message appears when trying to run nvidia-ctk or nvidia-container-runtime:

nvidia-ctk: error while loading shared libraries: unexpected PLT reloc type 0x00

This is consistent with the earlier findings of elias-dbx about gold setting PLT entry of 0x0. No idea why it happens only with go 1.21 though

elezar commented 1 year ago

@davidepi I note that we have https://github.com/NVIDIA/nvidia-container-toolkit/issues/101 also opened against a tool that consumes go-nvml. In which environment (distribution) are you manually compiling the nvidia-container-toolkit?

braydonk commented 1 year ago

I'd like to confirm that this segfault has started happening when using Go 1.21.1.

When the symbol lookup happens through dlopen and dlsym, the symbol is found. However, when calling the symbol through the generated bindings, all symbols result in 0x0.

I'm happy to provide any additional debug information as I am actively looking into this issue and trying to resolve it.

braydonk commented 1 year ago

Another piece of info that might be useful:

We're not sure why yet, but on Ubuntu 20.04 this still seems to work with Go 1.21.1. However, when we run with Go 1.21.1 on Debian 11 or Ubuntu 22.04, we get the issue I mentioned above.

Also to clarify, this is happening on any symbol. I tried going into library code and forcing it to run other random symbols, and every one I tried resulted in the invalid address 0x0 panic.

davidepi commented 1 year ago

@elezar fresh installation of Gentoo, which is unsupported, but I see that braydonk confirmed that happens also on Ubuntu and Debian.

As you noted in nvidia-container-toolkit, I believe this is the same problem as https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/issues/17 Checking the date of that issue, it matches with the release of go 1.21 that happened a couple of days prior. However I lack information about how that package was built so I cannot be 100% sure.

I tried the same solution on that post, by downloading already existing binaries, but it segfaults when running nvidia-ctk. With these precompiled binaries the problem is not when starting them without arguments but when generating the CDI spec. I just copied the compiled binaries, so I can guess it was still calling the system-provided go.

braydonk commented 1 year ago

I was able to reproduce this without go-nvml at all. I opened an issue in the Go repo. https://github.com/golang/go/issues/63264

braydonk commented 1 year ago

I found the reason for the issue in go1.21.x. You can see my investigation and eventual findings in the mentioned golang/go issue. I have opened a PR that seems to resolve this issue based on my findings.

halohsu commented 10 months ago

Keep the gcc = gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)

NVIDIA / go-nvml

Segfault on nvml.Init #36