facebookarchive / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Apache License 2.0
8.42k stars 1.94k forks source link

Using single GPU other than GPU 0 makes system unreachable #699

Open akshay-raj-dhamija opened 7 years ago

akshay-raj-dhamija commented 7 years ago

Hello, I used the following code to run a simple MNIST example using a GPU on a remote machine

gpu_no=1
training_model = cnn.CNNModelHelper(order="NCHW",name="training_net",use_cudnn=True)
training_model.net.RunAllOnGPU(gpu_id=gpu_no, use_cudnn=True)
training_model.param_init_net.RunAllOnGPU(gpu_id=gpu_no, use_cudnn=True)
workspace.ResetWorkspace()
soft=AddLeNetModel(training_model)
AddTrainingOperators(training_model, soft)
workspace.RunNetOnce(training_model.param_init_net)
workspace.CreateNet(training_model.net,overwrite=True,input_blobs=['data','label'])

To run the network for different iterations I use

workspace.FeedBlob("data", data, device_option)
workspace.FeedBlob("label", label, device_option)
workspace.RunNet(training_model.net, num_iter=1)

the above code initially works fine on gpu_no 1 and then hangs the gpus, ultimately making the system unreachable. Please note: the same works well without any issues on gpu_no 0.

Yangqing commented 7 years ago

This seems to be some issue with the GPU installation. What does nvidia-smi show? Is there some other processes running on gpu number 1 when it freezes?

akshay-raj-dhamija commented 7 years ago

Thanks for your response. The only processes running on both the GPUs were just the caffe2 code. Below, is a screen shot of nvidia-smi just before the GPU's froze.

screen shot 2017-05-31 at 8 45 44 pm