Unable to Find libnvidia-ml.so.1 When Using "docker compose linux-gpu up"

medined commented 1 year ago

Here is the result of my command. Is this error inside the container or outside? The weird part to me is:

genai-stack-pull-model-1 | pulling ollama model llama2 using http://llm-gpu:11434

The docs told me to add that URL to the .env file. However, I certainly don't have server running there.

$ docker compose --profile linux-gpu up
WARN[0000] The "LANGCHAIN_PROJECT" variable is not set. Defaulting to a blank string. 
WARN[0000] The "LANGCHAIN_API_KEY" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_ACCESS_KEY_ID" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_SECRET_ACCESS_KEY" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_DEFAULT_REGION" variable is not set. Defaulting to a blank string. 
WARN[0000] The "LANGCHAIN_PROJECT" variable is not set. Defaulting to a blank string. 
WARN[0000] The "LANGCHAIN_API_KEY" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_ACCESS_KEY_ID" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_SECRET_ACCESS_KEY" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_DEFAULT_REGION" variable is not set. Defaulting to a blank string. 
WARN[0000] The "LANGCHAIN_PROJECT" variable is not set. Defaulting to a blank string. 
WARN[0000] The "LANGCHAIN_API_KEY" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_ACCESS_KEY_ID" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_SECRET_ACCESS_KEY" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_DEFAULT_REGION" variable is not set. Defaulting to a blank string. 
WARN[0000] The "OPENAI_API_KEY" variable is not set. Defaulting to a blank string. 
WARN[0000] The "LANGCHAIN_PROJECT" variable is not set. Defaulting to a blank string. 
WARN[0000] The "LANGCHAIN_API_KEY" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_ACCESS_KEY_ID" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_SECRET_ACCESS_KEY" variable is not set. Defaulting to a blank string. 
WARN[0000] The "AWS_DEFAULT_REGION" variable is not set. Defaulting to a blank string. 
[+] Running 4/4
 ✔ llm-gpu 3 layers [⣿⣿⣿]      0B/0B      Pulled                                                                                                        1.3s 
   ✔ aece8493d397 Already exists                                                                                                                        0.0s 
   ✔ 3b9196308e0f Already exists                                                                                                                        0.0s 
   ✔ e75cbce7870b Already exists                                                                                                                        0.0s 
[+] Building 0.0s (0/0)                                                                                                                 docker:desktop-linux
[+] Running 8/8
 ✔ Container genai-stack-llm-gpu-1     Created                                                                                                          0.0s 
 ✔ Container genai-stack-database-1    Running                                                                                                          0.0s 
 ✔ Container genai-stack-pull-model-1  Recreated                                                                                                        0.1s 
 ✔ Container genai-stack-api-1         Recreated                                                                                                        0.1s 
 ✔ Container genai-stack-bot-1         Recreated                                                                                                        0.1s 
 ✔ Container genai-stack-pdf_bot-1     Recreated                                                                                                        0.1s 
 ✔ Container genai-stack-loader-1      Recreated                                                                                                        0.1s 
 ✔ Container genai-stack-front-end-1   Recreated                                                                                                        0.1s 
Attaching to genai-stack-api-1, genai-stack-bot-1, genai-stack-database-1, genai-stack-front-end-1, genai-stack-llm-gpu-1, genai-stack-loader-1, genai-stack-pdf_bot-1, genai-stack-pull-model-1
genai-stack-pull-model-1  | pulling ollama model llama2 using http://llm-gpu:11434
genai-stack-pull-model-1  | Error: Head "http://llm-gpu:11434/": dial tcp 172.18.0.4:11434: connect: no route to host
genai-stack-pull-model-1 exited with code 1
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

matthieuml commented 1 year ago

The docs told me to add that URL to the .env file. However, I certainly don't have server running there.

If the container genai-stack-llm-gpu-1 is running, then you have a server running at http://llm-gpu:11434/ internally to Docker.

What seems here to be the issue here is your Nvidia runtime integration with Docker.

Are you able to run this command successfully?

docker run -it --rm --gpus all ubuntu nvidia-smi

If not try to reinstall Docker.

Toparvion commented 10 months ago

@matthieuml , I've faced the same issue and tried the command you've proposed. The error in the result is the same as I see when run GenAI stack with --profile linux-gpu, namely:

#…
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

I have also followed your advice from #62 (installed nvidia-container-toolkit) but nothing has changed.

The main hint here seems that I run the stack in Docker Desktop 4.26.1 (on Ubuntu 23.10). The nvidia-smi displays the following about the GPU:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   47C    P8              11W /  55W |    628MiB /  6144MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Some reported issues that I've found in the net so far suggest to use Docker CE instead of Docker Desktop. But it looks opposite to what GenAI stack promotes — the easy and developer-friendly way to build LLM-powered applications.

Is there other way to resolve the issue?

matthieuml commented 9 months ago

After looking a bit around, it seems that nvidia-container-toolkit needs docker-ce installed as root to work (which isn't the case with Docker Desktop?).

The obvious way to resolve this issue would be to use docker-ce installed as root or even podman as an alternative. The Docker CLI is well documented and in combination with docker-compose you can deploy the stack quite easily.

However, if you want to keep a developer-friendly UI, maybe you could use portainer-ce in combination with docker-ce as root?

Toparvion commented 9 months ago

@matthieuml ,

which isn't the case with Docker Desktop?

Yes, this seems to be the root cause.

Ok, I'll switch to Docker CE.

Perhaps, it's worth adding a note about Docker Desktop incompatibility with linux-gpu profile to the README.md as well as mentioning the necessity to install nvidia-container-toolkit.

Thank you!

AnerGcorp commented 3 months ago

The Issue is in like in system package

I am using cog.yaml file in order to install system dependencies, I have tried different versions of nvidia drivers and cuda versions I am getting the same error. Is there any way to install that package in system level

suveerudayashankara commented 1 week ago

@Toparvion where you able to run the docker?? having the same issue so. could you help me?

AnerGcorp commented 1 week ago

@suveerudayashankara Can you please share your host machine operating system and version of docker and docker engine?

AnerGcorp commented 1 week ago

@suveerudayashankara If you are using linux, please also install support packages in docker for nvidia drivers, I hope that solves your issue.

Toparvion commented 1 week ago

@Toparvion where you able to run the docker?? having the same issue so. could you help me?

@suveerudayashankara , yes, I followed the above advice to switch from Docker Desktop to Docker CE (+Portainer) and it worked for me.

docker / genai-stack

Unable to Find libnvidia-ml.so.1 When Using "docker compose linux-gpu up" #95

The Issue is in like in system package