Closed homework36 closed 1 month ago
fixed. but this might happen again in the future. Used command:
sudo apt-get purge "*nvidia*"
reboot
sudo dpkg --configure -a
sudo apt-get install linux-headers-$(uname -r)
Followed by steps in #1140
Avoid using rodan2.simssa.ca for now. See issue #1162
After seeing some strange error messages, I ssh into the current rodan2 server and find this weird issue:
The gpu-celery container failed and cannot restart because it cannot call the nvidia driver. I verified that all related nvidia packages are installed properly, but
nvidia-smi
returns error, saying that it cannot communicate with the nvidia driver.I purged everything related to nvidia and tried to reinstall it but had this error message now
which did not appear at all before.
I'm fixing this but hopefully that's not a sign the vGPU instances are not reliable...