[Bug]: Unhandled exception thrown in ~cudnn_context()

kSkip commented 7 months ago

What Operating System(s) are you seeing this problem on?

Windows

dlib version

14ba5572e7b1ce4c5eb5ca690484583a4bc4717f

Python version

N/A

Compiler

MSVC 14 (Visual Studio 16.11.34)

Expected Behavior

Programs that correctly train neural networks and reach the end of main() without error should terminate successfully.

Current Behavior

Either during the exit of main() or afterwards, an access violation occurs in ~cudnn_context(). The exact error is

Exception thrown at <address> (nvcuda64.dll) in test.exe:
0xC0000005: Access violation reading location <address>

and appears to be a problem when calling cudnnDestroy here https://github.com/davisking/dlib/blob/14ba5572e7b1ce4c5eb5ca690484583a4bc4717f/dlib/cuda/cudnn_dlibapi.cpp#L91

Steps to Reproduce

Building and running dnn_introduction_ex.cpp will demonstrate the error. I believe running any dnn training example will.

Here is a minimal example though

#include <dlib/dnn.h>

using namespace dlib;

int main(void)
{
    using net_type = loss_binary_log<fc<1, input<matrix<float>>>>;

    std::vector<matrix<float>> mini_batch_samples = { {1.0f, 1.0f, 1.0f, 1.0f} };
    std::vector<float> mini_batch_labels = { 1.0f };

    net_type net;
    dnn_trainer<net_type> trainer(net);
    trainer.train_one_step(mini_batch_samples, mini_batch_labels);

    return 0;
}

Anything else?

I tested this with different versions of CUDA and cuDNN. This error occurs with CUDA/cuDNN versions 12.4/9.0 and also 11.8/8.9.7. I have two GTX 1080 Ti GPUs with driver version 551.86.

It seems similar to https://github.com/davisking/dlib/issues/2186.

Additionally, I have written some small programs with the cuDNN api directly (no dlib), and I have not encountered the error when calling cudnnDestroy is those cases.

This occurs both with Debug and Release builds.

davisking commented 7 months ago

Is it somehow related to using the trainer and not just exercising the network object in a way that calls cudnn? This really ought to work fine, unless there is some new change in visual studio+cudnn that makes thread_local not work here.

kSkip commented 7 months ago

Exercising the network does not reproduce the error, but using the trainer does.

When I created this issue, I did not realize that training is performed in a separate thread even with a single device. So, I tested the effect of the thread_local specifier, and found that this code

#include <iostream>
#include <thread>
#include <cudnn.h>

class cudnn_context
{
public:
    cudnn_context(const cudnn_context&) = delete;
    cudnn_context& operator=(const cudnn_context&) = delete;

    cudnn_context() : handle(nullptr) {}
    ~cudnn_context()
    {
        if (handle)
            cudnnDestroy(handle);
    }

    cudnnHandle_t get_handle()
    {
        if (!handle)
            cudnnCreate(&handle);
        return handle;
    }

private:
    cudnnHandle_t handle;
};

static cudnnHandle_t context()
{
    thread_local cudnn_context c;
    return c.get_handle();
}

void fn()
{
    std::cout << context() << "\n";
    return;
}

int main(void)
{
    std::thread t(fn);
    t.join();
    return 0;
}

prints the handle address value and then throws the access violation.

However, this example

#include <iostream>
#include <thread>
#include <cudnn.h>

void fn()
{
    cudnnHandle_t handle;
    cudnnCreate(&handle);
    std::cout << handle << "\n";
    cudnnDestroy(handle);
    return;
}

int main(void)
{
    std::thread t(fn);
    t.join();
    return 0;
}

prints the address and terminates successfully.

I would think that thread_local basically does the same thing as the second example, but maybe there is something else happening to the CUDA context.

Thanks for the help!

davisking commented 7 months ago

Yeah there seems to be a bug in cuda or that version of visual studio maybe. Try this kind of thing

The little crashing program you have there but replace thread_local with static. Does that crash? Maybe it's purely about cudnnDestroy() being called after main() has terminated. Which really ought to be fine.

kSkip commented 7 months ago

Changing the specifier to static eliminates the exception. Also, calling cudnnDestroy from main() after the thread is finished works successfully.

Maybe it's purely about cudnnDestroy() being called after main() has terminated

That does not appear to be the issue though. I verified that the thread_local destructor is called before all the global/static object destructors. Additionally, putting the code in a scope block like this

int main(void)
{
    {
        std::thread t(fn);
        t.join();
    } // exception thrown
    return 0; // program never arrives here
}

still results in an exception even though main() has not returned.

I did a little more investigating and it appears to be an issue elsewhere in PyTorch https://github.com/pytorch/pytorch/issues/17658, tvm https://github.com/apache/tvm/pull/8267, and Caffe2 https://github.com/pytorch/pytorch/pull/95382. It is mostly (but not exclusively) happening on Windows, and appears to be a problem with cuDNN/CUDA.

Anyways, I made some local changes to the cudnn_context to implement a static handle pool that reuses handles when a thread is done and cleanups on program exit (PyTorch does something similar). This works because cuDNN handles are tied to a device context, not a thread. The only requirement is that multiple threads do not use a handle simultaneously. It fixed the problem for me, and has a added benefit of not having to re-initialize a handle for every new thread.

Let me know if you are interested in this kind of change. I am happy to submit a PR.

davisking commented 7 months ago

Yeah, a PR that works around the problem would be cool. It's got to work in all contexts though. Not sure if you were saying it wasn't totally thread safe.

So what's the exact bug? Just that you can't call cudnnDestroy in code executed by thread_local? It might also be that you can't call cudnnCreate from one thread and then cudnnDestroy from a different thread. I've no idea how visual studio executes thread_local. Maybe it doesn't execute the thread_local destruction on the same thread that created the thread local object? I'm just suspicious that the problem isn't purely about thread_local, since that seems like something surprising to have cudnn somehow mess up. I wouldn't be surprised if there was a "oh but cudnnDestroy can't run on a different thread from the one that created the handle bug" though.

davisking commented 7 months ago

Well, in any case, post your change and we can look at it :)

kSkip commented 7 months ago

Not sure if you were saying it wasn't totally thread safe.

I believe a reusable handle pool should be thread safe, at least based on this statement from the cudnn documentation on thread safety

The cuDNN library is thread-safe. Its functions can be called from multiple host threads, so long as the threads do not share the same cuDNN handle simultaneously.

which supports the idea that changing ownership is acceptable.

So what's the exact bug?

Honestly, I don't know. It makes no sense that the thread_local destructor could be an issue. I verified it works properly with several MSVC versions.

However, after testing several different drivers, CUDA/cuDNN versions, and MSVC toolkits I finally found that the 537.83 NVIDIA driver works. So, it seems this is somehow a driver bug that is present in possibly all of the 54*.* and 55.** Windows drivers. I filed a bug report with NVIDIA.

Thanks for your help @davisking!

davisking commented 7 months ago

Sweet, yeah, glad it's working now and we don't have to hack around it :D

davisking / dlib