DDMAL / Rodan

:dragon_face: A web-based workflow engine.
https://rodan2.simssa.ca/
45 stars 13 forks source link

[rodan2.simssa.ca back now] vGPU driver not stable on Rodan production server #1161

Closed homework36 closed 1 month ago

homework36 commented 1 month ago

After seeing some strange error messages, I ssh into the current rodan2 server and find this weird issue:

The gpu-celery container failed and cannot restart because it cannot call the nvidia driver. I verified that all related nvidia packages are installed properly, but nvidia-smi returns error, saying that it cannot communicate with the nvidia driver.

I purged everything related to nvidia and tried to reinstall it but had this error message now

Error! Your kernel headers for kernel 5.10.0-30-cloud-amd64 cannot be found.

which did not appear at all before.

I'm fixing this but hopefully that's not a sign the vGPU instances are not reliable...

homework36 commented 1 month ago

fixed. but this might happen again in the future. Used command:

sudo apt-get purge "*nvidia*"
reboot
sudo dpkg --configure -a
sudo apt-get install linux-headers-$(uname -r)

Followed by steps in #1140

homework36 commented 1 month ago

Avoid using rodan2.simssa.ca for now. See issue #1162