Open ronghanghu opened 9 years ago
I have encountered similar issues while using a self-coded LSTM layer to train translation model. I could be sure that I hadn't introduce any random factors in my own coding part, either in the config .prototxt -- the random seeds were remained same each time; yet eachtime I run training procedure they were yielding different loss values except for the initial one.
Moreover I have turned off the cuDNN flag while compling the enviroment, so I guess there might still be some random factors in the training related part of caffe. FYI I run the experiments on single GPU.
I think the cuDNN non-deterministic behavior is cuased by resetting the diffs every time in CuDNNConvolutionLayer::Backward_gpu(). Actually, the diffs are already set in Net::ClearParamDiffs(). It seems that it's not a bug in the situation of multi-gpus but in "iter_size > 1" .
templatevoid CuDNNConvolutionLayer ::Backward_gpu(const vector *>& top, const vector & propagate_down, const vector *>& bottom) { const Dtype* weight = NULL; Dtype* weight_diff = NULL; if (this->param_propagate_down_[0]) { weight = this->blobs_[0]->gpu_data(); weight_diff = this->blobs_[0]->mutable_gpu_diff(); caffe_gpu_set(this->blobs_[0]->count(), Dtype(0), weight_diff); } Dtype* bias_diff = NULL; if (this->bias_term_ && this->param_propagate_down_[1]) { bias_diff = this->blobs_[1]->mutable_gpu_diff(); caffe_gpu_set(this->blobs_[1]->count(), Dtype(0), bias_diff); }
@FicusRong These two line seems to be introduced in #3160. I'll take a look. Thanks for reporting!
I encounter the problem 3 in multiple GPU training. - I use two data input layers (One for image and the other for labels (multi-dimensional)) and the program crashed with the following error (Training is all fine with single GPU). Is there any workaround currently to solve this problem?:
* Aborted at 1453395416 (unix time) try "date -d @1453395416" if you are using GNU date * PC: @ 0x7f17e7c7da5f (unknown) * SIGSEGV (@0xaa1c000) received by PID 22896 (TID 0x7f17e9587780) from PID 178372608; stack trace: * @ 0x7f17e7b62d40 (unknown) @ 0x7f17e7c7da5f (unknown) @ 0x7f17e8c43a9c std::vector<>::erase() @ 0x7f17e8c42807 caffe::DevicePair::compute() @ 0x7f17e8c47c4c caffe::P2PSync<>::run() @ 0x407dc1 train() @ 0x405bc1 main @ 0x7f17e7b4dec5 (unknown) @ 0x4062d1 (unknown) @ 0x0 (unknown) Segmentation fault (core dumped)
@ronghanghu to fix 1. would it be okay if the default algorithms (bwd_filteralgo and bwd_dataalgo) were changed to 1 (determinisc according to cuDNN docs) when a random_seed
is given by the user? I mean, if the user sets the random_seed
, he must be expecting determinic behavior.
I couldn't find any information on the impact it would cause in terms of performance. Should we expect any side-effect besides performance issues if we manually set those algorithms to 1 and rebuild Caffe as a temporary fix (instead of disabling cuDNN)? NVIDIA's fork already has this change.
- Multi-GPU traininig: the order of data fetching and coordination between multiple data layers in a net are non-deterministic (based on race conditions). This is my fault in #2903.
This is fixed with the switch to new parallelism in #4563. The non-determinism of cuDNN can be addressed by setting engine: CAFFE
instead and for CPU one can pick BLAS other than MKL. I think these are acceptable workarounds. A FORCE_DETERMINISM
mode that trades performance for determinism could be incorporated however for a more fatalistic Caffe.
I have meet the same question.
I am faced with the same issue.
Tried rodrigoberriel's solution, but still got non-deterministic results when training. Is there a way to get consistent results for each run without disabling cuDNN since doing so will slow down the training?
Although there have been a lot of efforts in Caffe (such as unified RNG) to ensure reproducible and deterministic results, Caffe is currently still non-deterministic in several ways as described below:
CUDNN_CONVOLUTION_BWD_DATA_ALGO_0
andCUDNN_CONVOLUTION_BWD_FILTER_ALGO_3
.1 & 2 (numerically non-determinism) can cause tests that relies on deterministic behavior (such as TestSnapshot in test_gradient_based_solver.cpp) to fail, while 3 can result in bugs like #2977.
This thread is opened to discuss how to cope with them (and possibly ensure determinism in Caffe?)