beehive-lab / TornadoVM

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages
https://www.tornadovm.org
Apache License 2.0
1.16k stars 109 forks source link

MatrixMultiplication example ERROR : clBuildProgram -> Returned: -11 #397

Open SirYwell opened 2 months ago

SirYwell commented 2 months ago

Describe the bug

A clear and concise description of what the bug is.

I'm getting the following error:

[TornadoVM-OCL-JNI] ERROR : clBuildProgram -> Returned: -11
uk.ac.manchester.tornado.api.exceptions.TornadoRuntimeException: [ERROR] Generated Kernel is NULL. 

when trying to run the MatrixMultiplication example using Docker.

My current assumption is that this is caused by my GPU not supporting FP16 (from my understanding, output of clinfo here: https://gist.github.com/SirYwell/bdc347db5c4b5f66e2c664666fb0313f), but TornadoVM unconditionally emitting https://github.com/beehive-lab/TornadoVM/blob/e8faf81c570fc8be3a025361cdb5e96984267105/tornado-drivers/opencl/src/main/java/uk/ac/manchester/tornado/drivers/opencl/graal/asm/OCLAssembler.java#L77

As I haven't yet figured out how I can build and run TornadoVM on NixOS directly, I can't test if changing that already fixes the problem.

How To Reproduce

After cloning the docker-tornado repository, I'm running ./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication --fullDebug -pk

The output can be found here: https://gist.github.com/SirYwell/d9ae4b5393de135ec15429c54d031820

Note: I made a few adjustments to make things work: pom.xml: I changed the version from tornado-api and tornado-matrices from 1.0.3-dev to 1.0.3. MatrixMultiplication.java: I changed WARMING_UP_ITERATIONS from 100 to 1 to reduce output. run_nvidia_openjdk.sh: I changed #!/bin/bash to #!/usr/bin/env bash to make it work on NixOS. I also added the --gpus all flag to the docker command.

Expected behavior

Expecting the example to run.

Computing system setup (please complete the following information):

Output of ./run_nvidia_openjdk.sh tornado --devices:

./run_nvidia_openjdk.sh tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 1
  Tornado device=0:0  (DEFAULT)
        OPENCL --  [NVIDIA CUDA] -- NVIDIA GeForce RTX 2070 SUPER
                Global Memory Size: 7.8 GB
                Local Memory Size: 48.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 64]
                Device OpenCL C version: OpenCL C 1.2

Output of ./run_nvidia_openjdk.sh tornado --version:

version=1.0.3
branch=master
commit=02cb3f7

Backends installed: 
         - opencl

Additional context

I'm using the nvidia-container-toolkit as described in https://github.com/beehive-lab/docker-tornadovm/issues/8, but I don't think it is related to this issue.


jjfumero commented 2 months ago

Ok, sorry I was part of it since I jump first to the docker-tornado repo. I will take a look and let you know.

jjfumero commented 2 months ago

My take here is that it is related to the thread-block being used.

Task info: s0.t0
    Backend           : OPENCL
    Device            : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [512, 512]
    Local  work size  : [32, 32, 1]
    Number of workgroups  : [16, 16]

[TornadoVM-OCL-JNI] ERROR : [JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
clEnqueueNDRangeKernel -> Returned: [JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3070 (Device 0).

Tornado does not fail during compilation, but during the kernel dispatch. See related issue (thread conversation): https://github.com/beehive-lab/TornadoVM/pull/356

If I select blocks of 16x16, then I get correct executions:

Task info: s0.t0
    Backend           : OPENCL
    Device            : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
    Dims              : 2
    Global work offset: [0, 0, 0]
    Global work size  : [512, 512, 1]
    Local  work size  : [16, 16, 1]
    Number of workgroups  : [32, 32, 1]

The diff:

diff --git a/example/pom.xml b/example/pom.xml
index 30c2487..785dc2d 100644
--- a/example/pom.xml
+++ b/example/pom.xml
@@ -29,12 +29,12 @@
     <dependency>
       <groupId>tornado</groupId>
       <artifactId>tornado-api</artifactId>
-      <version>1.0.3-dev</version>
+      <version>1.0.4-dev</version>
     </dependency>
     <dependency>
       <groupId>tornado</groupId>
       <artifactId>tornado-matrices</artifactId>
-      <version>1.0.3-dev</version>
+      <version>1.0.4-dev</version>
     </dependency>
     <dependency>
       <groupId>junit</groupId>
diff --git a/example/src/main/java/example/MatrixMultiplication.java b/example/src/main/java/example/MatrixMultiplication.java
index a37b0d6..d97439e 100644
--- a/example/src/main/java/example/MatrixMultiplication.java
+++ b/example/src/main/java/example/MatrixMultiplication.java
@@ -19,6 +19,7 @@ package example;
 import java.util.Random;
 import java.util.stream.IntStream;

+import uk.ac.manchester.tornado.api.*;
 import uk.ac.manchester.tornado.api.ImmutableTaskGraph;
 import uk.ac.manchester.tornado.api.TaskGraph;
 import uk.ac.manchester.tornado.api.TornadoExecutionPlan;
@@ -88,7 +89,12 @@ public class MatrixMultiplication {

         ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();
         TornadoExecutionPlan executor = new TornadoExecutionPlan(immutableTaskGraph);
-        executor.withWarmUp();
+
+        WorkerGrid workerGrid = new WorkerGrid2D(matrixA.getNumRows(), matrixA.getNumColumns());
+        GridScheduler gridScheduler = new GridScheduler("s0.t0", workerGrid);
+        workerGrid.setLocalWork(16, 16, 1);
+
+        executor.withGridScheduler(gridScheduler).withWarmUp();

         // 1. Warm up Tornado
         for (int i = 0; i < WARMING_UP_ITERATIONS; i++) {
diff --git a/example/target/example-1.0-SNAPSHOT.jar b/example/target/example-1.0-SNAPSHOT.jar
index df83b80..9fff7be 100644
Binary files a/example/target/example-1.0-SNAPSHOT.jar and b/example/target/example-1.0-SNAPSHOT.jar differ

Funny, beacuse on my native system, I can run with blocks of 32x32, so this is a driver issue within the Docker image.

jjfumero commented 2 months ago

I pushed the "fix" in the tornado-docker repo: https://github.com/beehive-lab/docker-tornadovm/commit/cb5f48c84c0fad345cd1fbaedfb4f68b637824f8

jjfumero commented 2 months ago

For reference:

./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication 1024
WARNING: Using incubator modules: jdk.incubator.vector
Computing MxM of 1024x1024
[TornadoVM] Warning: The loop bounds will be configured by the GridScheduler. Check the grid by using the flag --threadInfo.
    Single Threaded CPU Execution: 0.92 GFlops, Total time = 2326 ms
    Streams Execution: 10.28 GFlops, Total time = 209 ms
    TornadoVM Execution on GPU (Accelerated): 1073.74 GFlops, Total Time = 2 ms
    Speedup: 1163.0x
    Verification true

I am using a new docker image for a new TornadoVM version coming up tomorrow ;-)

SirYwell commented 2 months ago

Thanks for looking into it so quickly. I'll try out tomorrow with the new version and report back!

jjfumero commented 2 months ago

New docker images: https://hub.docker.com/r/beehivelab/tornadovm-nvidia-openjdk

SirYwell commented 2 months ago

Hi @jjfumero, I tried with the new docker image, but I'm still getting the same error. Running ./run_nvidia_openjdk.sh tornado --version now outputs

version=1.0.4
branch=master
commit=585574e

Backends installed: 
         - opencl

so I assume it's running the correct version.

Is there any way I can get more information out of OpenCL? Or is there any information I can provide that helps you? It's also not completely unrealistic that there is something wrong due to NixOS, but my understanding how things interact with each other isn't deep enough here. (I'm currently trying to manually build TornadoVM from source, but it fails due to cmake not finding the jni/jawt header files, which I also don't understand as they are present)

jjfumero commented 2 months ago

The TornadoVM automatic installer should bring all necessary dependencies (Java, cmake, maven, etc).

I do not know if there is something specific for NixOS, but to install TonadoVM from source, can you try the following:

Assuming you want the OpenCL backend and you have the NVIDIA Driver installed for your NVIDIA GPU:

$ ./bin/tornadovm-installer  --jdk jdk21 --backend=opencl
stratika commented 2 months ago

In my system (Ubuntu 23.10), I tested the script that failed for you and it is working:

./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication --fullDebug -pk
WARNING: Using incubator modules: jdk.incubator.vector
Computing MxM of 512x512
[INFO] Loading Backend: uk.ac.manchester.tornado.drivers.opencl.OCLTornadoDriverProvider@52aa2946
TornadoGraph dependency matrix...
+----+---------------+
|  5 [data]| <none>
|----+---------------+
|  6 [data]| <none>
|----+---------------+
|  7 [data]| <none>
|----+---------------+
|  8 [data]| <none>
|----+---------------+
|  9 [data]| 10
|----+---------------+
| 10 [task]| 6 7 8
|----+---------------+
| 11 [data]| 10
|----+---------------+
| 12 [data]| 11
|----+---------------+
| 13 [data]| 11
|----+---------------+
| 14 [data]| 11
|----+---------------+

-----------------------------------
Device Table:
[0]:  [NVIDIA CUDA] -- NVIDIA RTX A2000 8GB Laptop GPU
Constant Table:
[0]: 512
Object Table:
[0]: 0x16f7c8c1 MatrixFloat <512 x 512>
[1]: 0x573906eb MatrixFloat <512 x 512>
[2]: 0x4ebff610 MatrixFloat <512 x 512>
Task Table:
[0]: task s0.t0 - matrixMultiplication
-----------------------------------
-----------------------------------
TaskGraph:
[0]: constant 0
[1]: object 0
[2]: object 1
[3]: object 2
[4]: context device=0, [ 5 6 7 8 10 11 12 13 14 ]
[5]: persist node
[6]: copy in object 0
[7]: copy in object 1
[8]: copy in object 2
[9]: dependent write on object 2 by task 10
[10]: task=0, args=[ 6 7 8 0 ]
[11]: copy out object 2 after task 10
[12]: deallocate object 1 after 11
[13]: deallocate object 2 after 11
[14]: deallocate object 3 after 11
-----------------------------------
[TornadoVM] Warning: The loop bounds will be configured by the GridScheduler. Check the grid by using the flag --threadInfo.
#pragma OPENCL EXTENSION cl_khr_fp64 : enable  
#pragma OPENCL EXTENSION cl_khr_fp16 : enable  
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable  
__kernel void matrixMultiplication(__global long *_kernel_context, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics, __global uchar *A, __global uchar *B, __global uchar *C, __private int size)
{
  int i_3, i_34, i_28, i_9, i_41, i_8, i_11, i_10, i_42, i_37, i_4, i_17, i_13, i_12, i_15, i_14, i_27, i_20; 
  long l_39, l_29, l_30, l_21, l_22, l_38; 
  float f_24, f_16, f_32, f_33; 
  ulong ul_19, ul_18, ul_23, ul_25, ul_26, ul_31, ul_1, ul_0, ul_35, ul_2, ul_5, ul_36, ul_7, ul_6, ul_40; 

  // BLOCK 0
  ul_0  =  (ulong) A;
  ul_1  =  (ulong) B;
  ul_2  =  (ulong) C;
  i_3  =  get_global_size(0);
  i_4  =  get_global_size(1);
  ul_5  =  ul_2 + 32L;
  ul_6  =  ul_1 + 32L;
  ul_7  =  ul_0 + 32L;
  i_8  =  get_global_id(0);
  i_9  =  get_global_id(1);
  i_10  =  _kernel_context[0];
  // BLOCK 1 MERGES [0 8 ]
  i_11  =  i_9;
  for(;i_11 < i_10;)
  {
    // BLOCK 2
    i_12  =  i_11 << 9;
    i_13  =  i_12 + 6;
    // BLOCK 3 MERGES [2 7 ]
    i_14  =  i_8;
    for(;i_14 < i_10;)
    {
      // BLOCK 4
      i_15  =  i_14 + 6;
      // BLOCK 5 MERGES [4 6 ]
      f_16  =  0.0F;
      i_17  =  0;
      for(;i_17 < i_10;)
      {
        // BLOCK 6
        ul_18  =  *((__global ulong *) ul_7);
        ul_19  =  ul_0 + ul_18;
        i_20  =  i_13 + i_17;
        l_21  =  (long) i_20;
        l_22  =  l_21 << 2;
        ul_23  =  ul_19 + l_22;
        f_24  =  *((__global float *) ul_23);
        ul_25  =  *((__global ulong *) ul_6);
        ul_26  =  ul_1 + ul_25;
        i_27  =  i_17 << 9;
        i_28  =  i_27 + i_15;
        l_29  =  (long) i_28;
        l_30  =  l_29 << 2;
        ul_31  =  ul_26 + l_30;
        f_32  =  *((__global float *) ul_31);
        f_33  =  fma(f_24, f_32, f_16);
        i_34  =  i_17 + 1;
        f_16  =  f_33;
        i_17  =  i_34;
      }  // B6
      // BLOCK 7
      ul_35  =  *((__global ulong *) ul_5);
      ul_36  =  ul_2 + ul_35;
      i_37  =  i_14 + i_13;
      l_38  =  (long) i_37;
      l_39  =  l_38 << 2;
      ul_40  =  ul_36 + l_39;
      *((__global float *) ul_40)  =  f_16;
      i_41  =  i_3 + i_14;
      i_14  =  i_41;
    }  // B7
    // BLOCK 8
    i_42  =  i_4 + i_11;
    i_11  =  i_42;
  }  // B8
  // BLOCK 9
  return;
}  //  kernel

    Single Threaded CPU Execution: 1.64 GFlops, Total time = 164 ms
    Streams Execution: 14.91 GFlops, Total time = 18 ms
    TornadoVM Execution on GPU (Accelerated): 268.44 GFlops, Total Time = 1 ms
    Speedup: 164.0x
    Verification true
cleanup: programs  ..........0.000256351 s
cleanup: context   ..........0.000006982 s
cleanup: total     ..........0.000263333 s
./run_nvidia_openjdk.sh tornado --version
version=1.0.4
branch=master
commit=585574e

Backends installed: 
     - opencl

I share the same opinion with @jjfumero, it will be easier to understand the problem with your platform if you can install it locally and run the example without docker. Do you have any modified source code?

SirYwell commented 2 months ago

I do not know if there is something specific for NixOS

Sadly there is, dynamically linked binaries typically cannot be executed without patching. That's why I tried the manual installation, I guess I have to figure out what's up with it not finding JNI headers... I'll let you know when I get it to run.

I'm just confused that it doesn't work in the docker container, especially as it finds the device.

stratika commented 2 months ago

I do not know if there is something specific for NixOS

Sadly there is, dynamically linked binaries typically cannot be executed without patching. That's why I tried the manual installation, I guess I have to figure out what's up with it not finding JNI headers... I'll let you know when I get it to run.

I'm just confused that it doesn't work in the docker container, especially as it finds the device.

It seems that the problem is in the driver when it compiles the generated kernel from TornadoVM. The CL_BUILD_PROGRAM_FAILURE is returned if clBuildProgram does not return till the build of the kernel is completed. My guess would be that the driver does not communicate correctly with the underlying driver in your local system. This container is build based on the nvidia/opencl image.

I would suggest you to try one of the polyglot images, if possible. Those images are bigger, but I had installed the NVIDIA OpenCL driver in the container image manually. For example the tornadovm-polyglot-graalpy. Also, this container is build on a commit point prior to the latest release, but it may help to see if the driver in the container fails.

SirYwell commented 2 months ago

I would suggest you to try one of the polyglot images, if possible. Those images are bigger, but I had installed the NVIDIA OpenCL driver in the container image manually. For example the tornadovm-polyglot-graalpy. Also, this container is build on a commit point prior to the latest release, but it may help to see if the driver in the container fails.

Thanks, I gave that a try and ran ./polyglotImages/polyglot-graalpy/tornadovm-polyglot-nvidia.sh tornado --printKernel --truffle python example/polyglot-examples/mxmWithTornadoVM.py. This seems to result in the same error as before sadly. (PS it looks like the docs are outdated, they reference a tornadovm-polyglot.sh file instead)

jjfumero commented 2 months ago

Going back to the initial problem:

[TornadoVM-OCL-JNI] ERROR : clBuildProgram -> Returned: -11

Which is different from the error I encountered. To debug your error, I suggest 2 things: 1) Run with --debug in TornadoVM to obtain the clBuild error messages. 2) Use the kernel that TornadoVM generates and compiling it for your own system. You can build a C++ program that just builds and compiles OpenCL and obtain the info. You can take this program as inspiration: https://github.com/jjfumero/scripts/tree/master/opencl/compileKernel

SirYwell commented 2 months ago

Thanks for the pointers. https://gist.github.com/SirYwell/d9ae4b5393de135ec15429c54d031820 already contains the output from running with --fullDebug (but only 1 iteration instead of 100).

The 2. seems to work, I can compile the example kernel and the one from the output I got from the MatrixMultiplication example. The compileKernel program crashes, but from my debugging, that only happens after the kernelBin.bin is already written (after the output of https://github.com/jjfumero/scripts/blob/c8e52c3e83bb7db529ab11f9ae1d61e738792d8d/opencl/compileKernel/compileKernel.cpp#L302 )