beehive-lab / TornadoVM

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages
https://www.tornadovm.org
Apache License 2.0
1.2k stars 114 forks source link

Error running CL_OUT_OF_RESOURCES when backend is openCL with Nvidia RTX 4080 Super #586

Closed rayman245 closed 2 weeks ago

rayman245 commented 2 weeks ago

Describe the bug

Getting the following error when trying to run the Matrix Multiplication 2D example provided with openCL. image

PTX works fine image

How To Reproduce

  1. Use the provided setup
  2. Install TornadoVM only with openCL as the backend and JDK 21 (or GraalVM JDK 21)
  3. setvars.cmd
  4. Run Matrix multiplication 2D

Expected behavior

Computing system setup:

stratika commented 2 weeks ago

hi @rayman245, thanks for the report. Can you please provide the command you are running? I am curious about the length of the arrays that cause the CL_OUT_OF_RESOURCES error. I will try to reproduce on my machine. Also, I guess you are using the master branch?

jjfumero commented 2 weeks ago

Hi @rayman245 . Let's check the thread-blocks. We have seen similar errors depending on the driver user.

Can you run with --threadInfo for both configurations (CUDA and OpenCL)? What I want to see is the block threads that the TornadoVM runtime selected.

An alternative is to try a different thread-block:

git diff
diff --git a/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java b/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
index 08cdb7918..aab913079 100644
--- a/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
+++ b/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
@@ -23,9 +23,12 @@ import java.util.ArrayList;
 import java.util.Random;
 import java.util.stream.IntStream;

+import uk.ac.manchester.tornado.api.GridScheduler;
 import uk.ac.manchester.tornado.api.ImmutableTaskGraph;
 import uk.ac.manchester.tornado.api.TaskGraph;
 import uk.ac.manchester.tornado.api.TornadoExecutionPlan;
+import uk.ac.manchester.tornado.api.WorkerGrid;
+import uk.ac.manchester.tornado.api.WorkerGrid2D;
 import uk.ac.manchester.tornado.api.annotations.Parallel;
 import uk.ac.manchester.tornado.api.enums.DataTransferMode;
 import uk.ac.manchester.tornado.api.enums.TornadoDeviceType;
@@ -101,6 +104,10 @@ public class MatrixMultiplication2D {
                 .task("t0", MatrixMultiplication2D::matrixMultiplication, matrixA, matrixB, matrixC, size) //
                 .transferToHost(DataTransferMode.EVERY_EXECUTION, matrixC);

+        WorkerGrid workerGrid = new WorkerGrid2D(size, size);
+        workerGrid.setLocalWork(16, 16, 1);
+        GridScheduler grid = new GridScheduler("s0.t0", workerGrid);
+
         ArrayList<Long> tornadoElapsedTime = new ArrayList<>();
         ArrayList<Long> javaElapsedTime = new ArrayList<>();
         ArrayList<Long> streamsElapsedTime = new ArrayList<>();
@@ -110,7 +117,7 @@ public class MatrixMultiplication2D {
         long end;
         TornadoDeviceType deviceType;
         try (TornadoExecutionPlan executor = new TornadoExecutionPlan(immutableTaskGraph)) {
-            executor.withWarmUp();
+            executor.withGridScheduler(grid).withWarmUp();

             // 1. Warm up Tornado

And then run with --threadInfo:

$ tornado --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D

Computing MxM of 512x512

Task info: s0.t0
    Backend           : OPENCL
    Device            : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
    Dims              : 2
    Global work offset: [0, 0, 0]
    Global work size  : [512, 512, 1]
    Local  work size  : [16, 16, 1]                     << Changed here
    Number of workgroups  : [32, 32, 1]
rayman245 commented 2 weeks ago

hi @rayman245, thanks for the report. Can you please provide the command you are running? I am curious about the length of the arrays that cause the CL_OUT_OF_RESOURCES error. I will try to reproduce on my machine. Also, I guess you are using the master branch?

Hi @stratika, sure actually I am using the provided command in the example -> tornado --threadInfo --jvm="-Ds0.t0.device=0:0" -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D

and yes I am using the master branch.

rayman245 commented 2 weeks ago

Hi @jjfumero, please find the results below

OpenCL

$ tornado --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D 

 -> Returned: -5
Task info: s0.t0
        Backend           : OPENCL
        Device            : NVIDIA GeForce RTX 4080 SUPER CL_DEVICE_TYPE_GPU (available)
        Dims              : 2
        Global work offset: [0, 0]
        Global work size  : [512, 512]
        Local  work size  : [32, 32, 1]
        Global work offset: [0, 0]
        Global work size  : [512, 512]
        Local  work size  : [32, 32, 1]
        Number of workgroups  : [16, 16]

[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -5
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 4080 SUPER (Device 0).

Task info: s0.t0
        Backend           : OPENCL
        Device            : NVIDIA GeForce RTX 4080 SUPER CL_DEVICE_TYPE_GPU (available)
        Dims              : 2
        Global work offset: [0, 0]
        Global work size  : [512, 512]
        Local  work size  : [32, 32, 1]
        Number of workgroups  : [16, 16]

[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 4080 SUPER (Device 0).

 -> Returned: -5
        Single Threaded CPU Execution: 0.99 GFlops, Total time = 271104200 ns
        Streams Execution: 8.22 GFlops, Total time = 32653400 ns
        TornadoVM Execution on GPU (Accelerated): 330.34 GFlops, Total Time = 812600 ns
        Speedup: 333.6256460743293x
        Verification false

CUDA

Task info: s0.t0
        Backend           : PTX
        Device            : NVIDIA GeForce RTX 4080 SUPER GPU
        Dims              : 2
        Thread dimensions : [512, 512]
        Blocks dimensions : [16, 16, 1]
        Grids dimensions  : [32, 32, 1]

Task info: s0.t0
        Backend           : PTX
        Device            : NVIDIA GeForce RTX 4080 SUPER GPU
        Dims              : 2
        Thread dimensions : [512, 512]
        Blocks dimensions : [16, 16, 1]
        Grids dimensions  : [32, 32, 1]

        Single Threaded CPU Execution: 1.07 GFlops, Total time = 251657000 ns
        Streams Execution: 7.27 GFlops, Total time = 36937500 ns
        TornadoVM Execution on GPU (Accelerated): 290.83 GFlops, Total Time = 923000 ns
        Speedup: 272.6511375947996x
        Verification true

Different Thread Block

Task info: s0.t0
        Backend           : OPENCL
        Device            : NVIDIA GeForce RTX 4080 SUPER CL_DEVICE_TYPE_GPU (available)
        Dims              : 2
        Global work offset: [0, 0, 0]
        Global work size  : [512, 512, 1]
        Local  work size  : [16, 16, 1]
        Number of workgroups  : [32, 32, 1]

Task info: s0.t0
        Backend           : OPENCL
        Device            : NVIDIA GeForce RTX 4080 SUPER CL_DEVICE_TYPE_GPU (available)
        Dims              : 2
        Global work offset: [0, 0, 0]
        Global work size  : [512, 512, 1]
        Local  work size  : [16, 16, 1]
        Number of workgroups  : [32, 32, 1]

        Single Threaded CPU Execution: 1.05 GFlops, Total time = 255384600 ns
        Streams Execution: 8.35 GFlops, Total time = 32154200 ns
        TornadoVM Execution on GPU (Accelerated): 376.49 GFlops, Total Time = 713000 ns
        Speedup: 358.1831697054698x
        Verification true
jjfumero commented 2 weeks ago

ok, thanks. We will fix this. The block of threads selected for the OpenCL seems to be wrong when using the 566.03 driver.

jjfumero commented 2 weeks ago

This is now fixed: https://github.com/beehive-lab/TornadoVM/pull/587