BVLC / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
34.1k stars 18.7k forks source link

Deal with Non-Deterministic Behavior (Ensure Determinism?) #3168

Open ronghanghu opened 9 years ago

ronghanghu commented 9 years ago

Although there have been a lot of efforts in Caffe (such as unified RNG) to ensure reproducible and deterministic results, Caffe is currently still non-deterministic in several ways as described below:

  1. GPU mode: cuDNN can be numerically non-deterministic with CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 and CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3.
  2. CPU mode: Intel MKL can be numerically non-deterministic. Details: https://github.com/BVLC/caffe/issues/3109#issuecomment-146280655
  3. Multi-GPU traininig: the order of data fetching and coordination between multiple data layers in a net are non-deterministic (based on race conditions). This is my fault in #2903.

1 & 2 (numerically non-determinism) can cause tests that relies on deterministic behavior (such as TestSnapshot in test_gradient_based_solver.cpp) to fail, while 3 can result in bugs like #2977.

This thread is opened to discuss how to cope with them (and possibly ensure determinism in Caffe?)

RolanChen commented 9 years ago

I have encountered similar issues while using a self-coded LSTM layer to train translation model. I could be sure that I hadn't introduce any random factors in my own coding part, either in the config .prototxt -- the random seeds were remained same each time; yet eachtime I run training procedure they were yielding different loss values except for the initial one.

Moreover I have turned off the cuDNN flag while compling the enviroment, so I guess there might still be some random factors in the training related part of caffe. FYI I run the experiments on single GPU.

FicusRong commented 9 years ago

I think the cuDNN non-deterministic behavior is cuased by resetting the diffs every time in CuDNNConvolutionLayer::Backward_gpu(). Actually, the diffs are already set in Net::ClearParamDiffs(). It seems that it's not a bug in the situation of multi-gpus but in "iter_size > 1" .

template 
void CuDNNConvolutionLayer::Backward_gpu(const vector*>& top,
    const vector& propagate_down, const vector*>& bottom) {
  const Dtype* weight = NULL;
  Dtype* weight_diff = NULL;
  if (this->param_propagate_down_[0]) {
    weight = this->blobs_[0]->gpu_data();
    weight_diff = this->blobs_[0]->mutable_gpu_diff();
    caffe_gpu_set(this->blobs_[0]->count(), Dtype(0), weight_diff);
  }
  Dtype* bias_diff = NULL;
  if (this->bias_term_ && this->param_propagate_down_[1]) {
    bias_diff = this->blobs_[1]->mutable_gpu_diff();
    caffe_gpu_set(this->blobs_[1]->count(), Dtype(0), bias_diff);
  }
ronghanghu commented 9 years ago

@FicusRong These two line seems to be introduced in #3160. I'll take a look. Thanks for reporting!

ylongqi commented 8 years ago

I encounter the problem 3 in multiple GPU training. - I use two data input layers (One for image and the other for labels (multi-dimensional)) and the program crashed with the following error (Training is all fine with single GPU). Is there any workaround currently to solve this problem?:

* Aborted at 1453395416 (unix time) try "date -d @1453395416" if you are using GNU date * PC: @ 0x7f17e7c7da5f (unknown) * SIGSEGV (@0xaa1c000) received by PID 22896 (TID 0x7f17e9587780) from PID 178372608; stack trace: * @ 0x7f17e7b62d40 (unknown) @ 0x7f17e7c7da5f (unknown) @ 0x7f17e8c43a9c std::vector<>::erase() @ 0x7f17e8c42807 caffe::DevicePair::compute() @ 0x7f17e8c47c4c caffe::P2PSync<>::run() @ 0x407dc1 train() @ 0x405bc1 main @ 0x7f17e7b4dec5 (unknown) @ 0x4062d1 (unknown) @ 0x0 (unknown) Segmentation fault (core dumped)

rodrigoberriel commented 7 years ago

@ronghanghu to fix 1. would it be okay if the default algorithms (bwd_filteralgo and bwd_dataalgo) were changed to 1 (determinisc according to cuDNN docs) when a random_seed is given by the user? I mean, if the user sets the random_seed, he must be expecting determinic behavior.

I couldn't find any information on the impact it would cause in terms of performance. Should we expect any side-effect besides performance issues if we manually set those algorithms to 1 and rebuild Caffe as a temporary fix (instead of disabling cuDNN)? NVIDIA's fork already has this change.

shelhamer commented 7 years ago
  1. Multi-GPU traininig: the order of data fetching and coordination between multiple data layers in a net are non-deterministic (based on race conditions). This is my fault in #2903.

This is fixed with the switch to new parallelism in #4563. The non-determinism of cuDNN can be addressed by setting engine: CAFFE instead and for CPU one can pick BLAS other than MKL. I think these are acceptable workarounds. A FORCE_DETERMINISM mode that trades performance for determinism could be incorporated however for a more fatalistic Caffe.

gaobb commented 6 years ago

I have meet the same question.

Himeshi commented 5 years ago

I am faced with the same issue.

Tried rodrigoberriel's solution, but still got non-deterministic results when training. Is there a way to get consistent results for each run without disabling cuDNN since doing so will slow down the training?