NVIDIA / NeMo-Aligner

Scalable toolkit for efficient model alignment
Apache License 2.0
625 stars 78 forks source link

LD_LIBRARY_PATH override in dockerfile causes failure in CI #336

Closed terrykong closed 1 month ago

terrykong commented 1 month ago

Describe the bug

Issue tracking issue where CI infra does not work if docker image contains an override for LD_LIBRARY_PATH

https://github.com/NVIDIA/NeMo-Aligner/pull/308

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

Additional context

Add any other context about the problem here. Example: GPU model

ko3n1g commented 1 month ago
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize.  GPU functionality will not be available.
   [[ System has unsupported display driver / cuda driver combination (error 803) ]]
ko3n1g commented 1 month ago

Refs:

@ko3n1g to try running on a node with drivers compatible to 24.07 (12.5.1)

ko3n1g commented 1 month ago

This thread suggests that we don't need the compat libcu.so file: https://github.com/NVIDIA/nvidia-docker/issues/1256#issuecomment-620088349 - they've seen the same issue like we do.

I don't think its an issue between PyT container and our host since we were also running 24.02 successfully (slightly older than this 24.03 but I'd be surprised if that'd make a difference).

Also, we don't set LD_LIBRARY_PATH in NeMo, so another hint that setting this env var might not be required.

Can you point me to an example where not setting this env var lead to a crash?

terrykong commented 1 month ago

The 803 error turned out to be due to infra drivers not being LTS (i.e., too new) so downgrading to LTS fixed the issue. Container appears to work LD_LIBRARY_PATH set, so I think we can close.