Open homework36 opened 5 months ago
New instances work with other GPU jobs but not PACO training (#1181). The GPU is accessible.
With updated hypervisors, this issue has come back again with our vGPU instances...
fixed
This issue came back on staging...
Not related to local or staging
Same issue as in #1161 In short, we have
despite successful installation of all related things. What makes this disturbing is it does not happen immediately but happens some time (potentially in the middle of a job) after we thought things should be running properly.
I just found out this happened to all 5 instances I have tested for Rodan prod so far, from Ubuntu 20, 22 to Debian 11, with both vGPU flavors, including those currently not ported to rodan2.simssa.ca (and thus not used by anyone). I suspect it is an issue with Arbutus and I'll send an email once I collect all information regarding this issue. Since they protect their vGPU drivers from the public, there is nothing much we can do from our end.
Things tested:
Testing current prod server with training job that needs GPU: Used the pixel zip that works on staging, got this error message
However, we are able to access GPU within this container
I did some simple computation using tensorflow and monitored the GPU usage. It seems to work fine.
There might be a different issue.
Update: just saw this line from Compute Canada vGPU page which i missed earlier
Not sure if it is related because our documentation does not ask for CUDA toolkit but the
GPU-celery
containerDockerfile
has requirement for CUDA.Unrelated fun things: The GPU used on staging (Tesla K80) came out in 2014 and is now worth $83.