NVIDIA / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
672 stars 263 forks source link

Diagnosing a problem in cudnn_conv_layer.cu (continuation of previous issue) #555

Closed excubiteur closed 5 years ago

excubiteur commented 5 years ago

To be absolutely clear, this is not a solution, only to test a suspicion.

I added a mutex to CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu

and the problem went away.

If I really need a fix to this problem that can be deployed to production, what would be the best approach?

ifdef USE_CUDNN

include

include "caffe/filler.hpp"

include "caffe/layers/cudnn_conv_layer.hpp"

include "caffe/net.hpp"

include "caffe/solver.hpp"

std::mutex test_mutex;

namespace caffe {

template<typename Ftype, typename Btype> void CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu(const vector<Blob>& bottom, const vector<Blob>& top) {

std::lock_guard guard(test_mutex);

drnikolaev commented 5 years ago

@excubiteur I see now what you mean. Thank you. I'll need to re-evaluate this approach. I'll let you know soon.

excubiteur commented 5 years ago

I am trying a temporary fix to tide me over until the next release. Please let me know what you think

For each of the statics like test_mem_req_allgrps there is a corresponding data member temp_test_mem_req_allgrps.

The member non-static temp_test_mem_req_allgrps gets initialized with test_mem_req_allgrps at the beginning of CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu. This is the only place where the static is read. The non-static member temp_test_mem_req_allgrps takes over from this point onwards, whether read or write. Then finally before exiting CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu, the static test_mem_req_allgrps gets written with the final value of temp_test_mem_req_allgrps.

drnikolaev commented 5 years ago

@excubiteur here is the fix candidate, would you mind to give it a try? https://github.com/drnikolaev/caffe/tree/caffe-0.17

excubiteur commented 5 years ago

I am still getting the same error.

To make sure I was doing it right, I checked that the changes were there - especially the removal of the statics.

The error occurs in exact same line of code in cudnn_conv_layer.cu (which was not changed in the fix).

The same experiment of crudely putting a mutex in CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu also made the problem go away.

For now I will use the crude mutex "fix" to tide me over.

I will try to give you code to reproduce the problem but this will take time (I will need to remove commercial code and unnecessary dependencies).

excubiteur commented 5 years ago

Looking at line 18 of cudnn_convlayer.cu GPUMemory::Workspace* ws = GPUMemory::workspace[Caffe::current_device()].get();

Both my caffe::Net instances are using device 0. Each caffe::Net instance in my case is doing "Forward" in separate independent threads.

drnikolaev commented 5 years ago

@excubiteur this is what I meant by commented function here: https://github.com/drnikolaev/caffe/blob/caffe-0.17/src/caffe/layers/cudnn_conv_layer.cpp#L29 But better solution is coming soon..

drnikolaev commented 5 years ago

@excubiteur please pull again and verify

excubiteur commented 5 years ago

It's looking good. If you don't hear from me after more than a week, you can consider it fixed perfectly :)

And many thanks for the incredible effort put into fixing this.

excubiteur commented 5 years ago

Painful discovery. caffe::GPUMemory::Scope gpu_memory_scope({0});
needs to appear once on every thread that calls Net::Forward. Was using a 3rd party framework that spawned and then killed threads for each task. Thread ID was the same each time since I was doing one task at a time. Took a while to realize that threads were spawned and killed and were getting the same ID each time.

drnikolaev commented 5 years ago

Looks like it’s time to clean all thread id dependences. I’ll check it out

drnikolaev commented 5 years ago

Hi @excubiteur please pull again.

excubiteur commented 5 years ago

So far so good. Will be reporting here if we observe any abnormal behavior. Many thanks for all the work put in :)