Closed rayman245 closed 2 weeks ago
hi @rayman245, thanks for the report. Can you please provide the command you are running? I am curious about the length of the arrays that cause the CL_OUT_OF_RESOURCES
error. I will try to reproduce on my machine. Also, I guess you are using the master
branch?
Hi @rayman245 . Let's check the thread-blocks. We have seen similar errors depending on the driver user.
Can you run with --threadInfo
for both configurations (CUDA and OpenCL)? What I want to see is the block threads that the TornadoVM runtime selected.
An alternative is to try a different thread-block:
git diff
diff --git a/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java b/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
index 08cdb7918..aab913079 100644
--- a/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
+++ b/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
@@ -23,9 +23,12 @@ import java.util.ArrayList;
import java.util.Random;
import java.util.stream.IntStream;
+import uk.ac.manchester.tornado.api.GridScheduler;
import uk.ac.manchester.tornado.api.ImmutableTaskGraph;
import uk.ac.manchester.tornado.api.TaskGraph;
import uk.ac.manchester.tornado.api.TornadoExecutionPlan;
+import uk.ac.manchester.tornado.api.WorkerGrid;
+import uk.ac.manchester.tornado.api.WorkerGrid2D;
import uk.ac.manchester.tornado.api.annotations.Parallel;
import uk.ac.manchester.tornado.api.enums.DataTransferMode;
import uk.ac.manchester.tornado.api.enums.TornadoDeviceType;
@@ -101,6 +104,10 @@ public class MatrixMultiplication2D {
.task("t0", MatrixMultiplication2D::matrixMultiplication, matrixA, matrixB, matrixC, size) //
.transferToHost(DataTransferMode.EVERY_EXECUTION, matrixC);
+ WorkerGrid workerGrid = new WorkerGrid2D(size, size);
+ workerGrid.setLocalWork(16, 16, 1);
+ GridScheduler grid = new GridScheduler("s0.t0", workerGrid);
+
ArrayList<Long> tornadoElapsedTime = new ArrayList<>();
ArrayList<Long> javaElapsedTime = new ArrayList<>();
ArrayList<Long> streamsElapsedTime = new ArrayList<>();
@@ -110,7 +117,7 @@ public class MatrixMultiplication2D {
long end;
TornadoDeviceType deviceType;
try (TornadoExecutionPlan executor = new TornadoExecutionPlan(immutableTaskGraph)) {
- executor.withWarmUp();
+ executor.withGridScheduler(grid).withWarmUp();
// 1. Warm up Tornado
And then run with --threadInfo
:
$ tornado --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D
Computing MxM of 512x512
Task info: s0.t0
Backend : OPENCL
Device : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
Dims : 2
Global work offset: [0, 0, 0]
Global work size : [512, 512, 1]
Local work size : [16, 16, 1] << Changed here
Number of workgroups : [32, 32, 1]
hi @rayman245, thanks for the report. Can you please provide the command you are running? I am curious about the length of the arrays that cause the
CL_OUT_OF_RESOURCES
error. I will try to reproduce on my machine. Also, I guess you are using themaster
branch?
Hi @stratika, sure actually I am using the provided command in the example -> tornado --threadInfo --jvm="-Ds0.t0.device=0:0" -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D
and yes I am using the master branch.
Hi @jjfumero, please find the results below
OpenCL
$ tornado --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D
-> Returned: -5
Task info: s0.t0
Backend : OPENCL
Device : NVIDIA GeForce RTX 4080 SUPER CL_DEVICE_TYPE_GPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [512, 512]
Local work size : [32, 32, 1]
Global work offset: [0, 0]
Global work size : [512, 512]
Local work size : [32, 32, 1]
Number of workgroups : [16, 16]
[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -5
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 4080 SUPER (Device 0).
Task info: s0.t0
Backend : OPENCL
Device : NVIDIA GeForce RTX 4080 SUPER CL_DEVICE_TYPE_GPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [512, 512]
Local work size : [32, 32, 1]
Number of workgroups : [16, 16]
[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 4080 SUPER (Device 0).
-> Returned: -5
Single Threaded CPU Execution: 0.99 GFlops, Total time = 271104200 ns
Streams Execution: 8.22 GFlops, Total time = 32653400 ns
TornadoVM Execution on GPU (Accelerated): 330.34 GFlops, Total Time = 812600 ns
Speedup: 333.6256460743293x
Verification false
CUDA
Task info: s0.t0
Backend : PTX
Device : NVIDIA GeForce RTX 4080 SUPER GPU
Dims : 2
Thread dimensions : [512, 512]
Blocks dimensions : [16, 16, 1]
Grids dimensions : [32, 32, 1]
Task info: s0.t0
Backend : PTX
Device : NVIDIA GeForce RTX 4080 SUPER GPU
Dims : 2
Thread dimensions : [512, 512]
Blocks dimensions : [16, 16, 1]
Grids dimensions : [32, 32, 1]
Single Threaded CPU Execution: 1.07 GFlops, Total time = 251657000 ns
Streams Execution: 7.27 GFlops, Total time = 36937500 ns
TornadoVM Execution on GPU (Accelerated): 290.83 GFlops, Total Time = 923000 ns
Speedup: 272.6511375947996x
Verification true
Different Thread Block
Task info: s0.t0
Backend : OPENCL
Device : NVIDIA GeForce RTX 4080 SUPER CL_DEVICE_TYPE_GPU (available)
Dims : 2
Global work offset: [0, 0, 0]
Global work size : [512, 512, 1]
Local work size : [16, 16, 1]
Number of workgroups : [32, 32, 1]
Task info: s0.t0
Backend : OPENCL
Device : NVIDIA GeForce RTX 4080 SUPER CL_DEVICE_TYPE_GPU (available)
Dims : 2
Global work offset: [0, 0, 0]
Global work size : [512, 512, 1]
Local work size : [16, 16, 1]
Number of workgroups : [32, 32, 1]
Single Threaded CPU Execution: 1.05 GFlops, Total time = 255384600 ns
Streams Execution: 8.35 GFlops, Total time = 32154200 ns
TornadoVM Execution on GPU (Accelerated): 376.49 GFlops, Total Time = 713000 ns
Speedup: 358.1831697054698x
Verification true
ok, thanks. We will fix this. The block of threads selected for the OpenCL seems to be wrong when using the 566.03
driver.
This is now fixed: https://github.com/beehive-lab/TornadoVM/pull/587
Describe the bug
Getting the following error when trying to run the Matrix Multiplication 2D example provided with openCL.
PTX works fine
How To Reproduce
Expected behavior
Computing system setup:
GPU (RTX 4080 Super)