Closed BrushXiaoMinGuo closed 7 months ago
Which version of go-nvml are you building against and how do you build your application?
go-nvml is v0.12.0-1
build command is : go build main.go
i use nvidia-mig-parted also have this problem
@elezar @klueska
Note that when building applications that use the bindings, one has to add the following go build flags:
go build -ldflags="-extldflags=-Wl,-z,lazy" <files>.go
as per https://github.com/NVIDIA/go-nvml/issues/36#issuecomment-1471594094 for example.
Since you mentioned mig-parted, I will confirm that we're applying the correct flags there too.
i tested this command, but i get the same error
@elezar
my go version is
@BrushXiaoMinGuo would you be able to test whether the behaviour persists with a newer golang version. We typically us at least 1.18 internally and it may be related to the cgo implementation for older golang versions.
@elezar i update go version to 1.20.8, and try again. but i get the same error
Just as a sanity check -- are you able to run nvidia-smi
from whatever environment you are in here? Is libnvidia-ml.so.1
in your library path?
Ignore my compiled program for now, I'll test nvidia-mig-manager first,i have the same error
there is something about my environment.
this is a k8s cluster, one node is gpu node.
nvidia-smi can run on this gpu node
libnvidia-ml.so.1 in my library path
mig-manager pod is running
i use this yaml and image https://github.com/NVIDIA/mig-parted/blob/main/deployments/gpu-operator/nvidia-mig-manager-example.yaml
but when i exec mig-manager pod, i can not run nvidia-smi
i think may be volume mount caught this ,but i don't know why
What else do I need to check? Thank you. @klueska @elezar
We currently don't support running the mig-manger
outside of the GPU Operator. Meaning these examples are likely out of date and probably need some tweaking to get them to work (though it's not recommended).
That said, I'm guessing the reason you are having issues is that the mig-manager
you are starting doesn't have GPU support injected into it.
This is normally done either through a runtime class called nvidia
or through making the nvidia
runtime the default runtime in containerd
. Which method are you using?
I think i use the second one.
i install nvidia-container-runtime on my gpu node
modify /etc/docker/daemon.json and set nvidia-container-runtime as the default runtime
That looks like a docker demon.json, not a containerd config
yes,it‘s docker demon.json. docker will call containerd.
how can mig-manager have GPU support injected into it, Could you give me some advice?
Unless you are using a very old version of kubernetes or have explicitly selected docker to be your shim layer in kubernetes, I would imagine that docker is not the container runtime ou are using to launch containers in k8s (containerd is the default).
My question before still stands:
This is normally done either through a runtime class called nvidia or through making the nvidia runtime the default runtime in containerd. Which method are you using?
Only once I know the answer to this can I help you further.
I faced the same problem: ./app: symbol lookup error: ./app: undefined symbol: nvmlErrorString
Error exists in code example:
ret := nvml.Init()
if ret != nvml.SUCCESS {
log.Fatalf("Unable to initialize NVML: %v", nvml.ErrorString(ret))
}
If nvml.Init() is not success just does not use nvml function nvml.ErrorString(ret)
@jgivc how are you building the application?
With regards to:
If nvml.Init() is not success just does not use nvml function nvml.ErrorString(ret)
This is not entirely true. At present, nvml.Init()
also loads the dynamic library and if this fails we the symbol is invalid. If the underlying call to nvmlInit
reutrns an error, calling nvml.ErrorString
is valid.
We have a workaround for this in another set of libraries we use: https://github.com/NVIDIA/go-nvlib/blob/486ed3f0c8139174a97565985fd48664b3048ad6/pkg/nvml/nvml.go#L38-L69 and we may consider doing something similar here.
how are you building the application?
Just go build
. And it work fine on my server. But on my computer that library does not exists and the example code gave me an error.
Not quite on topic, but also a problem with 'undefined symbol'. My application was working fine for several days and suddenly crashed with the error "undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3", although before that there were several successful calls to nvml library methods. Is it possible to somehow intercept such errors so that the application does not crash?
i import nvml in my code,but when run code,have this error
@klueska