Closed jhgoh closed 6 months ago
mewtwo 서버의 nvidia드라이버에 mismatch발생.
root@mewtwo:~# nvidia-smi Failed to initialize NVML: Driver/library version mismatch NVML library version: 535.171
패키지 업데이트 후 리부팅. nvidia 드라이버 동작 확인 완료.
jhgoh@mewtwo:~$ nvidia-smi Tue Apr 16 22:45:43 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 1080 Ti Off | 00000000:02:00.0 Off | N/A | | 23% 31C P8 9W / 250W | 2MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
리부팅 후 slurm 노드 정보 업데이트 완료.
[root@hep:~]# scontrol update nodename=mewtwo state=resume [root@hep:~]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up infinite 4 alloc entei,ho-oh,raikou,suicune gpu1 up infinite 1 idle lapras gpu2 up infinite 1 idle mewtwo
mewtwo 서버의 nvidia드라이버에 mismatch발생.
패키지 업데이트 후 리부팅. nvidia 드라이버 동작 확인 완료.
리부팅 후 slurm 노드 정보 업데이트 완료.