Closed terrykong closed 1 month ago
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available.
[[ System has unsupported display driver / cuda driver combination (error 803) ]]
Refs:
@ko3n1g to try running on a node with drivers compatible to 24.07 (12.5.1)
This thread suggests that we don't need the compat libcu.so file: https://github.com/NVIDIA/nvidia-docker/issues/1256#issuecomment-620088349 - they've seen the same issue like we do.
I don't think its an issue between PyT container and our host since we were also running 24.02 successfully (slightly older than this 24.03 but I'd be surprised if that'd make a difference).
Also, we don't set LD_LIBRARY_PATH
in NeMo, so another hint that setting this env var might not be required.
Can you point me to an example where not setting this env var lead to a crash?
The 803 error turned out to be due to infra drivers not being LTS (i.e., too new) so downgrading to LTS fixed the issue. Container appears to work LD_LIBRARY_PATH set, so I think we can close.
Describe the bug
Issue tracking issue where CI infra does not work if docker image contains an override for LD_LIBRARY_PATH
https://github.com/NVIDIA/NeMo-Aligner/pull/308
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
Additional context
Add any other context about the problem here. Example: GPU model