Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered

Shikherneo2 commented 5 years ago

Hi

I have been facing this issue for a while now, where the training suddenly stops with the error "Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered". Weirdly, it runs fine until about 150K iterations, after which this error pops up. I tried reducing the batch size, but it didnt help. Does not seem to be an out of memory issue, as I checked the usage when the error ocurred. Thank you for any help.

I0524 19:24:23.882612 13813 solver.cpp:486] Iteration 150000, lr = 0.01 F0524 19:24:39.746389 13813 math_functions.cu:81] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered Check failure stack trace: @ 0x7efd2b3690cd google::LogMessage::Fail() @ 0x7efd2b36af33 google::LogMessage::SendToLog() @ 0x7efd2b368c28 google::LogMessage::Flush() @ 0x7efd2b36b999 google::LogMessageFatal::~LogMessageFatal() @ 0x7efd2b81f8ba caffe::caffe_gpu_memcpy() @ 0x7efd2b7a1ac0 caffe::SyncedMemory::gpu_data() @ 0x7efd2b675562 caffe::Blob<>::gpu_data() @ 0x7efd2b6ad189 caffe::BaseConvolutionLayer<>::forward_gpu_bias() @ 0x7efd2b7e18d8 caffe::ConvolutionLayer<>::Forward_gpu() @ 0x7efd2b785b9a caffe::Net<>::ForwardFromTo() @ 0x7efd2b785cc7 caffe::Net<>::ForwardPrefilled() @ 0x7efd2b79f556 caffe::Solver<>::Step() @ 0x7efd2b79fea2 caffe::Solver<>::Solve() @ 0x55ee0359557c train() @ 0x55ee03592487 main @ 0x7efd2a7c7b97 __libc_start_main @ 0x55ee03592c2a _start

JimHeo commented 5 years ago

Did you solve it? I got the same issue...when i run compute_bn_statistics.py with iter 40000 model for PASCAL VOC dataset.

slimway commented 4 years ago

I've been facing the same issue when changing the number of classes and the class_weights. the weird part is that when the labels 0 weight is set to zero, it works perfectly fine.

alexgkendall / caffe-segnet

Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered #147