BVLC / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
34.11k stars 18.7k forks source link

Multi-GPU detection problems with illegal memory access error #5498

Open alfredox10 opened 7 years ago

alfredox10 commented 7 years ago

Full error:

F0407 22:35:23.664752 27364 syncedmem.hpp:22] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered

Issue summary

This happens when I'm trying to run image detection using trained rcnn models on a python script that splits a stream of images into multiple python sub-processes and loads models for each GPU under each child process. I always see the memory go up on GPU 0, but not on the other 8 GPUs available in the system. I am trying to implement parallel GPU detection by splitting the task to 8 GPUs on a p2.8xlarge AWS ec2 instance.

Has anyone seen this? I know caffe isn't optimized for multi-GPU training but I did not think there would be any issues if I split up the processes independently and just ran detections on each GPU?

I am using this command in python to set each GPU in each subprocess: caffe.set_mode_gpu() caffe.set_device(gpu_slot) rcnn_net = caffe.Net(prototxt, caffemodel, caffe.TEST)

Is there something else I should set? Does the caffe library have hard-coded to only use shared memory on GPU 0 for all GPUs? Any information would be helpful.

Steps to reproduce

Run Caffe in multiple terminal windows (easier than writing a multiprocess python application) each assigned to a different GPU, and then attempt to perform detections in parallel as normally done in caffe API with these commands

net.set_input_arrays(data4D.astype(np.float32), data4DLabels.astype(np.float32)) prediction = net.forward()

System configuration

Operating system: ubuntu headless 14.04 Compiler: gcc 4.7 CUDA version (if applicable): 7.5 CUDNN version (if applicable): 4 Python or MATLAB version (for pycaffe and matcaffe respectively): 2.7

cypof commented 7 years ago

What happens when you try to use only GPU 1? The GPU selection is per-thread, do you have other threads running in your app?

shelhamer commented 7 years ago

It's important to set_device() before set_mode_gpu() like so:

caffe.set_device(gpu_slot)
caffe.set_mode_gpu()
rcnn_net = caffe.Net(prototxt, caffemodel, caffe.TEST)
nnop commented 7 years ago

Why the order of two setting statements matters? @shelhamer @cypof