NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)
Apache License 2.0
290 stars 62 forks source link

nvmlErrorString error #82

Closed BrushXiaoMinGuo closed 7 months ago

BrushXiaoMinGuo commented 11 months ago

i import nvml in my code,but when run code,have this error

image

@klueska

elezar commented 11 months ago

Which version of go-nvml are you building against and how do you build your application?

BrushXiaoMinGuo commented 11 months ago

go-nvml is v0.12.0-1 image

build command is : go build main.go

BrushXiaoMinGuo commented 11 months ago

i use nvidia-mig-parted also have this problem

image

@elezar @klueska

elezar commented 11 months ago

Note that when building applications that use the bindings, one has to add the following go build flags:

go build -ldflags="-extldflags=-Wl,-z,lazy" <files>.go

as per https://github.com/NVIDIA/go-nvml/issues/36#issuecomment-1471594094 for example.

Since you mentioned mig-parted, I will confirm that we're applying the correct flags there too.

BrushXiaoMinGuo commented 11 months ago

i tested this command, but i get the same error

image image

@elezar

BrushXiaoMinGuo commented 11 months ago

my go version is

image
elezar commented 11 months ago

@BrushXiaoMinGuo would you be able to test whether the behaviour persists with a newer golang version. We typically us at least 1.18 internally and it may be related to the cgo implementation for older golang versions.

BrushXiaoMinGuo commented 11 months ago

@elezar i update go version to 1.20.8, and try again. but i get the same error

image
klueska commented 11 months ago

Just as a sanity check -- are you able to run nvidia-smi from whatever environment you are in here? Is libnvidia-ml.so.1 in your library path?

BrushXiaoMinGuo commented 11 months ago

Ignore my compiled program for now, I'll test nvidia-mig-manager first,i have the same error

image

there is something about my environment.

  1. this is a k8s cluster, one node is gpu node.

  2. nvidia-smi can run on this gpu node

    image
  3. libnvidia-ml.so.1 in my library path

    image
  4. mig-manager pod is running

    image
  5. i use this yaml and image https://github.com/NVIDIA/mig-parted/blob/main/deployments/gpu-operator/nvidia-mig-manager-example.yaml

  6. but when i exec mig-manager pod, i can not run nvidia-smi

    image
  7. i think may be volume mount caught this ,but i don't know why

    image

    What else do I need to check? Thank you. @klueska @elezar

klueska commented 11 months ago

We currently don't support running the mig-manger outside of the GPU Operator. Meaning these examples are likely out of date and probably need some tweaking to get them to work (though it's not recommended).

That said, I'm guessing the reason you are having issues is that the mig-manager you are starting doesn't have GPU support injected into it.

This is normally done either through a runtime class called nvidia or through making the nvidia runtime the default runtime in containerd. Which method are you using?

BrushXiaoMinGuo commented 11 months ago

I think i use the second one.

  1. i install nvidia-container-runtime on my gpu node

    image
  2. modify /etc/docker/daemon.json and set nvidia-container-runtime as the default runtime

    image
klueska commented 11 months ago

That looks like a docker demon.json, not a containerd config

BrushXiaoMinGuo commented 11 months ago

yes,it‘s docker demon.json. docker will call containerd.
how can mig-manager have GPU support injected into it, Could you give me some advice?

klueska commented 11 months ago

Unless you are using a very old version of kubernetes or have explicitly selected docker to be your shim layer in kubernetes, I would imagine that docker is not the container runtime ou are using to launch containers in k8s (containerd is the default).

My question before still stands:

This is normally done either through a runtime class called nvidia or through making the nvidia runtime the default runtime in containerd. Which method are you using?

Only once I know the answer to this can I help you further.

jgivc commented 11 months ago

I faced the same problem: ./app: symbol lookup error: ./app: undefined symbol: nvmlErrorString Error exists in code example:

ret := nvml.Init()
if ret != nvml.SUCCESS {
    log.Fatalf("Unable to initialize NVML: %v", nvml.ErrorString(ret))
}

If nvml.Init() is not success just does not use nvml function nvml.ErrorString(ret)

elezar commented 11 months ago

@jgivc how are you building the application?

With regards to:

If nvml.Init() is not success just does not use nvml function nvml.ErrorString(ret)

This is not entirely true. At present, nvml.Init() also loads the dynamic library and if this fails we the symbol is invalid. If the underlying call to nvmlInit reutrns an error, calling nvml.ErrorString is valid.

We have a workaround for this in another set of libraries we use: https://github.com/NVIDIA/go-nvlib/blob/486ed3f0c8139174a97565985fd48664b3048ad6/pkg/nvml/nvml.go#L38-L69 and we may consider doing something similar here.

jgivc commented 11 months ago

how are you building the application?

Just go build. And it work fine on my server. But on my computer that library does not exists and the example code gave me an error.

jgivc commented 10 months ago

Not quite on topic, but also a problem with 'undefined symbol'. My application was working fine for several days and suddenly crashed with the error "undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3", although before that there were several successful calls to nvml library methods. Is it possible to somehow intercept such errors so that the application does not crash?