error: 'hipErrorMemoryAllocation'(1002) at src/caffe/syncedmem.cpp:56

xxgtxx commented 7 years ago

Issue summary

Hello everyone, I get a memory error when benchmarking execution time with hipCaffe. This only occurs with large input data: shape: { dim: 10 dim: 3 dim: 1024 dim: 2048 }. However, there is no error with the default input data size: shape: { dim: 10 dim: 3 dim: 224 dim: 224 }

Error

I0914 12:29:28.475584 218425 net.cpp:228] conv2/3x3 does not need backward computation. I0914 12:29:28.475591 218425 net.cpp:228] conv2/relu_3x3_reduce does not need backward computation. I0914 12:29:28.475597 218425 net.cpp:228] conv2/3x3_reduce does not need backward computation. I0914 12:29:28.475603 218425 net.cpp:228] pool1/norm1 does not need backward computation. I0914 12:29:28.475610 218425 net.cpp:228] pool1/3x3_s2 does not need backward computation. I0914 12:29:28.475616 218425 net.cpp:228] conv1/relu_7x7 does not need backward computation. I0914 12:29:28.475622 218425 net.cpp:228] conv1/7x7_s2 does not need backward computation. I0914 12:29:28.475628 218425 net.cpp:228] data does not need backward computation. I0914 12:29:28.475632 218425 net.cpp:270] This network produces output prob I0914 12:29:28.475728 218425 net.cpp:283] Network initialization done. I0914 12:29:28.476884 218425 caffe.cpp:355] Performing Forward I0914 12:30:55.425948 218425 caffe.cpp:360] Initial loss: 0 I0914 12:30:55.426497 218425 caffe.cpp:361] Performing Backward I0914 12:30:55.426555 218425 caffe.cpp:369] Benchmark begins I0914 12:30:55.426565 218425 caffe.cpp:370] Testing for 2 iterations. error: 'hipErrorMemoryAllocation'(1002) at src/caffe/syncedmem.cpp:56

Steps to reproduce

hipCaffe with Makefile.config parameters: USE_MIOPEN := 1 USE_ROCBLAS := 1

Change data input size to: layer { name: "data" type: "Input" top: "data" input_param { shape: { dim: 10 dim: 3 dim: 1024 dim: 2048 } }

Then execute the network: /home/intel/hipCaffe/build/tools/caffe time -gpu 0 -iterations 2 -model /home/intel/hipCaffe/models/bvlc_googlenet/deploy.prototxt

Your system configuration

Operating system: Ubuntu 16.04 Kernel: 4.11.0-kfd-compute-rocm-rel-1.6-148 CPU: Intel Skylake GPU: AMD Radeon Vega Frontier Edition @ 16GB

parallelo commented 7 years ago

@xxgtxx - Thanks for the error report. Looks like it is running out of memory due to the chosen larger data size. Can you please try dropping the batch size (specifically this parameter: dim: 10) and see if that helps.

xxgtxx commented 7 years ago

Hey, I'm also seeing this error with batch size 1. { dim: 1 dim: 3 dim: 1024 dim: 2048 }.

parallelo commented 7 years ago

Based on a standard bvlc/caffe GoogLeNet test I did today on another HW vendor's platform, it didn't look like your specific configuration fits into that device's (quite large) memory either. Have you observed something different?

dhzhd1 commented 6 years ago

Can't agree more with parallelo. When the training image is too large to allocate the GPU memory, several ways you can do, 1) reduce the batch size (seems you already tried), 2) resize the image to a small dimension, 3) if the object feature will vanish after resizing the image, you need to crop the image to small ones. I think that is data pre-processing related issue, not directly issue on caffe.

ROCm / hipCaffe