Closed excubiteur closed 5 years ago
@excubiteur I see now what you mean. Thank you. I'll need to re-evaluate this approach. I'll let you know soon.
I am trying a temporary fix to tide me over until the next release. Please let me know what you think
For each of the statics like test_mem_req_allgrps there is a corresponding data member temp_test_mem_req_allgrps.
The member non-static temp_test_mem_req_allgrps gets initialized with test_mem_req_allgrps at the beginning of CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu. This is the only place where the static is read. The non-static member temp_test_mem_req_allgrps takes over from this point onwards, whether read or write. Then finally before exiting CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu, the static test_mem_req_allgrps gets written with the final value of temp_test_mem_req_allgrps.
@excubiteur here is the fix candidate, would you mind to give it a try? https://github.com/drnikolaev/caffe/tree/caffe-0.17
I am still getting the same error.
To make sure I was doing it right, I checked that the changes were there - especially the removal of the statics.
The error occurs in exact same line of code in cudnn_conv_layer.cu (which was not changed in the fix).
The same experiment of crudely putting a mutex in CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu also made the problem go away.
For now I will use the crude mutex "fix" to tide me over.
I will try to give you code to reproduce the problem but this will take time (I will need to remove commercial code and unnecessary dependencies).
Looking at line 18 of cudnn_convlayer.cu GPUMemory::Workspace* ws = GPUMemory::workspace[Caffe::current_device()].get();
Both my caffe::Net instances are using device 0. Each caffe::Net instance in my case is doing "Forward" in separate independent threads.
@excubiteur this is what I meant by commented function here: https://github.com/drnikolaev/caffe/blob/caffe-0.17/src/caffe/layers/cudnn_conv_layer.cpp#L29 But better solution is coming soon..
@excubiteur please pull again and verify
It's looking good. If you don't hear from me after more than a week, you can consider it fixed perfectly :)
And many thanks for the incredible effort put into fixing this.
Painful discovery.
caffe::GPUMemory::Scope gpu_memory_scope({0});
needs to appear once on every thread that calls Net::Forward.
Was using a 3rd party framework that spawned and then killed threads for each task.
Thread ID was the same each time since I was doing one task at a time.
Took a while to realize that threads were spawned and killed and were getting
the same ID each time.
Looks like it’s time to clean all thread id dependences. I’ll check it out
Hi @excubiteur please pull again.
So far so good. Will be reporting here if we observe any abnormal behavior. Many thanks for all the work put in :)
To be absolutely clear, this is not a solution, only to test a suspicion.
I added a mutex to CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu
and the problem went away.
If I really need a fix to this problem that can be deployed to production, what would be the best approach?
ifdef USE_CUDNN
include
include "caffe/filler.hpp"
include "caffe/layers/cudnn_conv_layer.hpp"
include "caffe/net.hpp"
include "caffe/solver.hpp"
std::mutex test_mutex;
namespace caffe {
template<typename Ftype, typename Btype> void CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu(const vector<Blob>& bottom, const vector<Blob>& top) {
std::lock_guard guard(test_mutex);