Closed arealclimber closed 1 month ago
146 主機上無法運行 nvidia cuda
sudo docker compose up -d
[+] Running 2/3
✔ Container isunfa-postgres-1 Running 0.0s
✔ Container isunfa-qdrant-1 Running 0.0s
⠙ Container isunfa-ollama-1 Starting 0.2s
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
nvidia-docker --version
nvidia-docker:無此指令
shirley@cafeca-dev-003:~/workspace/ServerSwarm/isunfa$ docker compose up -d
[+] Running 8/9
✔ Network isunfa_default Created 0.1s
✔ Container isunfa-qdrant-1 Started 0.3s
⠼ Container isunfa-ollama-1 Starting 0.3s
✔ Container isunfa-postgres-1 Started 0.3s
✔ Container isunfa-aich-1 Created 0.0s
✔ Container isunfa-faith-1 Created 0.0s
✔ Container isunfa-isunfa-1 Created 0.0s
✔ Container isunfa-nginx-1 Created 0.0s
✔ Container ofelia Created 0.0s
Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]]
[x] 啟動含有 GPU 的 docker,需要確認的安裝步驟
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
docker compose down
docker compose up -d
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
# 查看驅動程式狀態
nvidia-smi
# 下載 cuda 最新版 https://hub.docker.com/r/nvidia/cuda/tags
docker pull nvidia/cuda:12.6.2-cudnn-devel-ubi9
# 確認 cuda image label (映像檔標籤)
docker run --rm --gpus all nvidia/cuda:12.6.2-cudnn-devel-ubi nvidia-smi
# 檢查內核模組,如果沒有東西,代表驅動程式未載入
lsmod | grep nvidia
# 查詢 GPU 型號
lspci | grep -i nvidia
# 01:00.0 VGA compatible controller: NVIDIA Corporation Device 2803 (rev a1)
# 01:00.1 Audio device: NVIDIA Corporation Device 22bd (rev a1)
# 查看 Nvidia 驅動程式檔案路徑
whereis nvidia
# 移除現有的 nvidia 驅動程式
sudo apt-get purge 'nvidia-*'
sudo apt-get autoremove
# 安裝 nvidia 的 PPA 並更新套件列表
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
# 安裝適合的 GPU 驅動程式版本
sudo ubuntu-drivers autoinstall
# 或者安裝特定版本
sudo apt install nvidia-driver-530
# 安裝好之後重啟系統
sudo reboot
# 查看驅動程式狀態
nvidia-smi
###
Mon Oct 21 16:48:37 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 34C P8 6W / 165W | 31MiB / 16380MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2284 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3926 G /usr/bin/gnome-shell 3MiB |
+-----------------------------------------------------------------------------------------+
###
# 安裝 nvidia container toolkit repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# 安裝 nvidia container toolkit
sudo apt-get update
sudo apt-get install -y nvidia-docker2
# 重啟 docker 服務
sudo systemctl restart docker
# 確認 docker 可以識別 nvidia gpu
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
###
Mon Oct 21 08:42:36 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 34C P8 7W / 165W | 31MiB / 16380MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
###
took 4 hrs done
[功能] 在另一台主機上執行 Server swarm isunfa 並對環境進行參數化和層次化