cogciprocate / ocl

OpenCL for Rust
Other
721 stars 75 forks source link

error: getting `STATUS_ACCESS_VIOLATION` in `Device::list()` #219

Closed poszu closed 1 year ago

poszu commented 1 year ago

The following simple program crashes on Windows 11 (Dell XPS 15 with Nvidia 3050 Ti):

fn main() {
    let platforms = ocl::Platform::list();
    for platform in platforms {
        _ = ocl::Device::list(platform, None).unwrap();
    }
}

Error:

error: process didn't exit successfully: `target\debug\crash_repro.exe` (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)

I'm unsure how to debug this, I suspect something is wrong with my environment/drivers, etc. Could you please help?

c0gent commented 1 year ago

I don't have any Nvidia machines set up right now and am unable to reproduce.

One thing I like to do is to write a C++ equivalent program to isolate the problem and determine whether it's a driver bug or not. It almost always is. By digging through the three abstraction layers (ocl -> ocl-core -> cl_sys) you can figure out exactly which parameters with and in what order each OpenCL function is being called. You could also just copy most of the boilerplate from a tutorial online (Here's one I randomly googled: https://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/).

poszu commented 1 year ago

@c0gent, I finally got some time to dig deeper. I haven't tried writing C++ equivalent yet, I will try it next. So far I noticed that crash happens only if I have the Intel integrated GPU enabled (I have Intel Iris Xe Graphics on i7-12700H with the most recent driver 5/12/2023 ver. 31.0.101.4338). It doesn't crash if I simply disable the device in Device Manager. Disabling Nvidia doesn't help.

I also simplified the program causing the crash to:

fn main() {
    let platforms = ocl::Platform::list();

    for platform in platforms.iter() {
        println!("Platform: {platform:?}");
        println!("Platform name: {:?}", platform.name());
    }
}

output:

Platform: Platform(PlatformId(0x131296445c0)) Platform name: Ok("NVIDIA CUDA") Platform: Platform(PlatformId(0x13130056d10)) error: process didn't exit successfully: target\debug\crash_repro.exe (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)

The crash happens on line: https://github.com/cogciprocate/ocl/blob/0c87db692a9eb8c03e9ccfb2248c92d4b13b4cb7/ocl-core/src/functions.rs#L610-L618

TBH, it looks like the problem is somewhere in the dll.

c0gent commented 1 year ago

Very interesting. Yeah, if you get time let's try the C++ equivalent just to isolate the problem for sure.

poszu commented 1 year ago

Reproduced it with C++. I followed the guide here: https://github.com/KhronosGroup/OpenCL-Guide/blob/main/chapters/getting_started_windows.md.

It seems to crash because platform->dispatch is not initialized: image

The program:

#include <stdio.h>
#include <vector>

#include <CL/cl.h>

int main()
{
    cl_uint numPlatforms = 0;
    cl_int CL_err = clGetPlatformIDs(0, NULL, &numPlatforms);
    if (CL_err == CL_SUCCESS) {
        printf_s("%u platform(s) found\n", numPlatforms);
    } else {
        printf_s("clGetPlatformIDs(%i)\n", CL_err);
        return 1;
    }

    // Get all platforms
    std::vector<cl_platform_id> platform(numPlatforms);
    CL_err = clGetPlatformIDs( numPlatforms, platform.data(), NULL );
    if (CL_err != CL_SUCCESS) {
        printf_s("clGetPlatformIDs failed: %i\n", CL_err);
        return 1;
    }

    // list names of all platforms
    for (cl_uint id = 0; id < numPlatforms; id++ ) {
        size_t result_size = 0;
        CL_err = clGetPlatformInfo(platform[id], CL_PLATFORM_NAME, 0, NULL, &result_size);
        if (CL_err != CL_SUCCESS) {
            printf_s("clGetPlatformIDs failed: %i\n", CL_err);
            return 1;
        }

        std::vector<char> result(result_size);
        CL_err = clGetPlatformInfo(platform[id], CL_PLATFORM_NAME, result_size, result.data(), NULL);
        if (CL_err != CL_SUCCESS) {
            printf_s("clGetPlatformIDs failed: %i\n", CL_err);
            return 1;
        }

        printf_s("Platform %i: %s\n", id, result.data());
    }
    return 0;
}
c0gent commented 1 year ago

Presumably some issue with the ICD which is a component which chooses between and loads the actual drivers. I might first try doing a clean install of your Nvidia drivers. If that doesn't work, try installing the latest Intel OpenCL CPU Runtime (Windows: win-oclcpuexp-2022.14.8.0.04_rel.zip). Perhaps the Intel ICD will work.

Beyond that you'll have to do some more digging on troubleshooting the ICD which is something I've never done on Windows and wouldn't be of any help. Good luck!

c0gent commented 1 year ago

Here's the Khronos ICD loader page you could try too: https://github.com/KhronosGroup/OpenCL-ICD-Loader. Should work with any driver.

poszu commented 1 year ago

I found out what was causing the crash. I had a weird Windows registry entry in HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors. There was an entry intelopencl64.dll with 0x00000000 value:

image

Removing it fixed the crash.

c0gent commented 1 year ago

Great!

I'm curious to know how you found your way to this solution. If you have time, would you mind posting how you troubleshooted this for the benefit of anyone else that comes across a similar issue.

No obligation. Grats again :)

poszu commented 1 year ago

Sure, but there was nothing more than luck :)

I ran out of options for debugging the C++ program and I figured that the problem must be in the DLL. So I looked over the Internet to see how the DLL is picked up and found a forum post mentioning that registry entry.

It is a chocolatey opencl-intel-cpu-runtime package that installs the DLL and adds this registry entry. Unfortunately, removing the package doesn't remove them - they must be removed manually. My guess is that a DLL meant for the CPU is loaded for the Intel GPU.