cpnr / computing

0 stars 0 forks source link

`mewtwo`에서의 `Driver/library version mismatch` #49

Closed slowmoyang closed 4 months ago

slowmoyang commented 4 months ago
$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.183
jhgoh commented 4 months ago

Updating packages

jhgoh commented 4 months ago

nvidia driver을 자동 업데이트 하면서 문제가 계속 발생하기 때문에 unattended-upgrade를 사용하지 않도록 설정.

/etc/apt/apt.conf.d/50unattended-upgrades파일 내에서 해당 부분을 아래와 같이 수정함.

Unattended-Upgrade::Package-Blacklist {
    "nvidia-";
    "libnvidia-";
};
jhgoh commented 4 months ago

reboot시작.

jhgoh commented 4 months ago
Fri Jun 28 14:35:43 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:02:00.0 Off |                  N/A |
| 23%   30C    P8               8W / 250W |      2MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
jhgoh commented 4 months ago
root@lugia:~# scontrol update nodename=mewtwo state=resume
root@lugia:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      1    mix entei
normal*      up   infinite      3   idle ho-oh,raikou,suicune
gpu1         up   infinite      1  down* lapras
gpu2         up   infinite      1   idle mewtwo