Closed poszu closed 1 year ago
I don't have any Nvidia machines set up right now and am unable to reproduce.
One thing I like to do is to write a C++ equivalent program to isolate the problem and determine whether it's a driver bug or not. It almost always is. By digging through the three abstraction layers (ocl -> ocl-core -> cl_sys) you can figure out exactly which parameters with and in what order each OpenCL function is being called. You could also just copy most of the boilerplate from a tutorial online (Here's one I randomly googled: https://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/).
@c0gent, I finally got some time to dig deeper. I haven't tried writing C++ equivalent yet, I will try it next. So far I noticed that crash happens only if I have the Intel integrated GPU enabled (I have Intel Iris Xe Graphics on i7-12700H with the most recent driver 5/12/2023 ver. 31.0.101.4338). It doesn't crash if I simply disable the device in Device Manager. Disabling Nvidia doesn't help.
I also simplified the program causing the crash to:
fn main() {
let platforms = ocl::Platform::list();
for platform in platforms.iter() {
println!("Platform: {platform:?}");
println!("Platform name: {:?}", platform.name());
}
}
output:
Platform: Platform(PlatformId(0x131296445c0)) Platform name: Ok("NVIDIA CUDA") Platform: Platform(PlatformId(0x13130056d10)) error: process didn't exit successfully:
target\debug\crash_repro.exe
(exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)
The crash happens on line: https://github.com/cogciprocate/ocl/blob/0c87db692a9eb8c03e9ccfb2248c92d4b13b4cb7/ocl-core/src/functions.rs#L610-L618
TBH, it looks like the problem is somewhere in the dll.
Very interesting. Yeah, if you get time let's try the C++ equivalent just to isolate the problem for sure.
Reproduced it with C++. I followed the guide here: https://github.com/KhronosGroup/OpenCL-Guide/blob/main/chapters/getting_started_windows.md.
It seems to crash because platform->dispatch
is not initialized:
The program:
#include <stdio.h>
#include <vector>
#include <CL/cl.h>
int main()
{
cl_uint numPlatforms = 0;
cl_int CL_err = clGetPlatformIDs(0, NULL, &numPlatforms);
if (CL_err == CL_SUCCESS) {
printf_s("%u platform(s) found\n", numPlatforms);
} else {
printf_s("clGetPlatformIDs(%i)\n", CL_err);
return 1;
}
// Get all platforms
std::vector<cl_platform_id> platform(numPlatforms);
CL_err = clGetPlatformIDs( numPlatforms, platform.data(), NULL );
if (CL_err != CL_SUCCESS) {
printf_s("clGetPlatformIDs failed: %i\n", CL_err);
return 1;
}
// list names of all platforms
for (cl_uint id = 0; id < numPlatforms; id++ ) {
size_t result_size = 0;
CL_err = clGetPlatformInfo(platform[id], CL_PLATFORM_NAME, 0, NULL, &result_size);
if (CL_err != CL_SUCCESS) {
printf_s("clGetPlatformIDs failed: %i\n", CL_err);
return 1;
}
std::vector<char> result(result_size);
CL_err = clGetPlatformInfo(platform[id], CL_PLATFORM_NAME, result_size, result.data(), NULL);
if (CL_err != CL_SUCCESS) {
printf_s("clGetPlatformIDs failed: %i\n", CL_err);
return 1;
}
printf_s("Platform %i: %s\n", id, result.data());
}
return 0;
}
Presumably some issue with the ICD which is a component which chooses between and loads the actual drivers. I might first try doing a clean install of your Nvidia drivers. If that doesn't work, try installing the latest Intel OpenCL CPU Runtime (Windows: win-oclcpuexp-2022.14.8.0.04_rel.zip). Perhaps the Intel ICD will work.
Beyond that you'll have to do some more digging on troubleshooting the ICD which is something I've never done on Windows and wouldn't be of any help. Good luck!
Here's the Khronos ICD loader page you could try too: https://github.com/KhronosGroup/OpenCL-ICD-Loader. Should work with any driver.
I found out what was causing the crash. I had a weird Windows registry entry in HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\OpenCL\Vendors
. There was an entry intelopencl64.dll
with 0x00000000 value:
Removing it fixed the crash.
Great!
I'm curious to know how you found your way to this solution. If you have time, would you mind posting how you troubleshooted this for the benefit of anyone else that comes across a similar issue.
No obligation. Grats again :)
Sure, but there was nothing more than luck :)
I ran out of options for debugging the C++ program and I figured that the problem must be in the DLL. So I looked over the Internet to see how the DLL is picked up and found a forum post mentioning that registry entry.
It is a chocolatey opencl-intel-cpu-runtime package that installs the DLL and adds this registry entry. Unfortunately, removing the package doesn't remove them - they must be removed manually. My guess is that a DLL meant for the CPU is loaded for the Intel GPU.
The following simple program crashes on Windows 11 (Dell XPS 15 with Nvidia 3050 Ti):
Error:
I'm unsure how to debug this, I suspect something is wrong with my environment/drivers, etc. Could you please help?