Is nvcaffe cudnn_conv_layer (.cu,.hpp,.cpp) safe to be used in two separate inferencing Net objects inferencing in separate threads?

NVIDIA / caffe

Caffe: a fast open framework for deep learning.

http://caffe.berkeleyvision.org/

Other

672 stars 263 forks source link

Is nvcaffe cudnn_conv_layer (.cu,.hpp,.cpp) safe to be used in two separate inferencing Net objects inferencing in separate threads? #554

Closed excubiteur closed 5 years ago

excubiteur commented 5 years ago

This is in relation to:

https://devtalk.nvidia.com/default/topic/1046795/jetson-tx2/nvcaffe-0-17-used-in-two-plugins-in-the-same-pipe-crashes/

I did more digging and found that test_mem_req_allgrps is a static member of CuDNNConvolutionLayer

So my question is: Is nvcaffe cudnn_conv_layer (.cu,.hpp,.cpp) safe to be used in two separate inferencing Net objects inferencing in separate threads?

excubiteur commented 5 years ago

I checked the main codeline version of caffe. That version of CuDNNConvolutionLayer does not have static data members.

drnikolaev commented 5 years ago

Hi @excubiteur There are more actually:

  static std::atomic<size_t> train_mem_req_all_grps_;
  static std::atomic<size_t> test_mem_req_all_grps_;
  static std::atomic<size_t> train_tmp_weights_mem_;

They are atomic counters maintaining maximum memory requirements across all groups. These things are thread safe by definition. This is something else. This seems to be the culprit:

I am also getting this: W0130 10:43:05.240928 23861 gpu_memory.cpp:129] Lazily initializing GPU Memory Manager Scope on device 0. Note: it's recommended to do this explicitly in your main() function. Not sure if it is related to the crash, but how do I initialize the "GPU Memory Manager Scope"?

Here is how:

   vector<int> gpus;
...fill gpus...
   caffe::GPUMemory::Scope gpu_memory_scope(gpus);

excubiteur commented 5 years ago

test_mem_req_allgrps

appears twice in

CuDNNConvolutionLayer<Ftype, Btype>::FindExConvAlgo

Even if atomic, would a change made to it by another thread affect the correctness of the member function? The second occurrence is as argument to mem_fmt which sounds like it could depend on test_mem_req_allgrps not being changed by another thread, but I can't be sure.

Thanks for any help you can provide.

excubiteur commented 5 years ago

Thanks for taking the time to look at the problem. Really appreciate it.

I tried caffe::GPUMemory::Scope. I get the exact same error in the same line of cudnn_conv_layer.cu