StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform
https://statcan.github.io/aaw/
Other
67 stars 12 forks source link

GPU Server Loses GPU #1952

Open StanHatko opened 4 months ago

StanHatko commented 4 months ago

In the past couple of days I've encountered the case of GPU servers suddenly losing the GPU. This has very rarely occurred in the past, but yesterday and today is occurring very frequently and is making GPU servers close to unusable.

It occurs in the following situation: if a process using the GPU exits (either normally at end of program or by ctrl-c) and a new task uses the GPU starts, there's a good chance the GPU will no longer be available for the new task. An existing nvidia-smi -l 1 processing running will continue to run and report 0 GPU usage, but if terminated and restarted nvidia-smi will not work, generating the error shown in the screenshot.

image

StanHatko commented 3 months ago

Possible workaround I am testing today: Open an ipython session, run the following, and leave it open in a separate terminal. Idea is to keep the GPU device in use (with a small tensor on the GPU) and prevent the GPU from detaching.

Code to run in ipython:

import torch
d = torch.device('cuda:0')
x = torch.randn([4, 4]).to(d)
x.device
StanHatko commented 3 months ago

That workaround seems to have been working for me so far today.

chuckbelisle commented 3 months ago

Thanks for the update @StanHatko ! I've added it to the AAW issue backlog and will be assessing it at a later date.

StanHatko commented 3 months ago

This workaround usually works (always for me until today), but on one server today the workaround failed and the GPU still detached. Hopefully this failure with the workaround remains rare but it can occur.

StanHatko commented 3 months ago

This workaround failed on another GPU server. It seems the workaround basically no longer works, at least today.

StanHatko commented 3 months ago

But restarting those servers and not using the workaround, it worked. So today it was inverted, problems occurred with the workaround but not without the workaround (just using the server normally).

StanHatko commented 3 months ago

It occurred for me just now without the workaround being on (so it can occur in both cases), though it seems less frequent today when not having the workaround active.

StanHatko commented 3 months ago

I'm currently trying the following modification below to the workaround to keep the GPU device active and stop it from detaching. So far it seems to be working, but that could be a coincidence.

import time
import torch
d = torch.device('cuda:0')
x = torch.randn([4, 4]).to(d)
print(x.device)

with torch.no_grad():
    while True:
        x = x + 0.01
        time.sleep(0.5)