LWJGL / lwjgl3

LWJGL is a Java library that enables cross-platform access to popular native APIs useful in the development of graphics (OpenGL, Vulkan, bgfx), audio (OpenAL, Opus), parallel computing (OpenCL, CUDA) and XR (OpenVR, LibOVR, OpenXR) applications.
https://www.lwjgl.org
BSD 3-Clause "New" or "Revised" License
4.78k stars 636 forks source link

opencl support for cpu: amd cpus not showing up as opencl devices #1010

Open goofyseeker311 opened 3 days ago

goofyseeker311 commented 3 days ago

Question

what is with the amd ryzen 5000 series cpus not showing up as opencl devices on windows 11? nvidia gpus and amd igpus show up just fine in the CLDemo java program. where is the issue?

self-answer: downloading and installing the intel opencl runtime for cpu works for amd cpus too. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-cpu-runtime-for-opencl-applications-with-sycl-support.html

goofyseeker311 commented 2 days ago

A second question, why is java lwjgl opencl simple math calculations much slower on nvidia discrete gpus than even amd cpus and igpus, by like 2-4x (multiplication and float4[] matrix multiplication). nvidia opencl cuda running the same opencl program is almost as fast as java auto-vectorized code on cpu. only taking account time taken to run clEnqueueNDRangeKernel() and clFinish(). all data is pre-uploaded and clFinish() before starting the benchmark run.

goofyseeker311 commented 1 day ago

what is wrong with the opencl/cuda, it gets about 1/1000 floating point operations of what it should be getting. say 2gflops instead of 0.7-2tflops for cpu. and 20gflops instead of 20tflops, for a gpu. yep doing plain C=A*B float multiplications for arrays. or float4 array multiplications with matrix shaped array.

goofyseeker311 commented 1 day ago

How can you get an long type event, from PointerBuffer, to be used for event profiling for NDRangeEnqueued kernel running. there is no overload for PointerBuffer type of clGetEventProfilingInfo, just the long event types. also the NDRangeEnqueue function only accepts PointerBuffer events, not long type of events.

In other words, how can you do kernel runtime start-end time profiling from lwjgl.

Spasi commented 20 hours ago

Hey @goofyseeker311,

The cl_event * event parameter of clEnqueueNDRangeKernel is an output parameter. If you pass a PointerBuffer there, when the call returns a cl_event value will have been written to it. Example code:

PointerBuffer pe = ...; // cl_event *
clEnqueueNDRangeKernel(..., pe);

long e = pe.get(0); // cl_event
clGetEventProfilingInfo(e, ...);
goofyseeker311 commented 19 hours ago

yes. (so how to get the profiling start/end times out of the event. instead of using the code below.)

nvm. somehow I was not able to get that pe.get(0); stuff working before. whatever I did wrong.

previous code looked like this:

PointerBuffer event = clStack.mallocPointer(1);
if (CL12.clEnqueueNDRangeKernel(clQueue, clKernel, dimensions, null, globalWorkSize, null, null, event)==CL12.CL_SUCCESS) {
    long ctimestart = System.nanoTime();
    CL12.clWaitForEvents(event);
    long ctimeend = System.nanoTime();
    float ctimedif = (ctimeend-ctimestart)/1000000.0f;
}

edit: new code looks like this:

PointerBuffer event = clStack.mallocPointer(1);
if (CL12.clEnqueueNDRangeKernel(clQueue, clKernel, dimensions, null, globalWorkSize, null, null, event)==CL12.CL_SUCCESS) {
    CL12.clWaitForEvents(event);
    long eventLong = event.get(0);
    long[] ctimestart = {0};
    long[] ctimeend = {0};
    CL12.clGetEventProfilingInfo(eventLong, CL12.CL_PROFILING_COMMAND_START, ctimestart, (PointerBuffer)null);
    CL12.clGetEventProfilingInfo(eventLong, CL12.CL_PROFILING_COMMAND_END, ctimeend, (PointerBuffer)null);
    float ctimedif = (ctimeend[0]-ctimestart[0])/1000000.0f;