cpnr / computing

0 stars 0 forks source link

mewtwo nvidia 드라이버 mismatch #41

Closed jhgoh closed 6 months ago

jhgoh commented 6 months ago

mewtwo 서버의 nvidia드라이버에 mismatch발생.

root@mewtwo:~# nvidia-smi 
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.171

패키지 업데이트 후 리부팅. nvidia 드라이버 동작 확인 완료.

jhgoh@mewtwo:~$ nvidia-smi 
Tue Apr 16 22:45:43 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:02:00.0 Off |                  N/A |
| 23%   31C    P8               9W / 250W |      2MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

리부팅 후 slurm 노드 정보 업데이트 완료.

[root@hep:~]# scontrol update nodename=mewtwo state=resume
[root@hep:~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      4  alloc entei,ho-oh,raikou,suicune
gpu1         up   infinite      1   idle lapras
gpu2         up   infinite      1   idle mewtwo