beehive-lab / TornadoVM

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages
https://www.tornadovm.org
Apache License 2.0
1.17k stars 110 forks source link

Windows variant of Linux installer without MSys2 #356

Closed otabuzzman closed 5 months ago

otabuzzman commented 5 months ago

Description

The PR is about an installer script to simplify installation on Windows. The script is supposed to work similar to the Linux one. It downloads and compiles all repos necessary to build TornadoVM. The script requires standard installations of Windows tools (Visual Studio Community 2022, CMake, Maven, and Python) as well as GraalVM unpacked somewhere in the file system.

The script is stored in bin. The name is tornadovm-installer.cmd. It provides a help option (--help). Further information is in an additional section on Windows installation in the documentation (readthedocs) of TornadoVM.

The script downloads the forked beehive-lab repos of the SPIR-V Toolkit and the LevelZero JNI, and checks out the winstall branch of each. Repo urls and branch names are hard-coded into the script. Both need to be changed after merging, if you decide to do so.

Repo urls and branch names have also been hard-coded into the bin/compile script used by the Linux installer. This has been done for testing purposes on Linux. The compile script thus too needs the above changes after merging.

Problem description

n/ a.

Backend/s tested

Mark the backends affected by this PR.

OS tested

Mark the OS where this PR is tested.

The unit tests provided with TornadoVM have been executed on Windows 11, Windows Server 2022 and Amazon Linux 2. Details are in this Google sheet. Some notes after a rough inspection:

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

How to test the new patch?

On a Windows box:


CLAassistant commented 5 months ago

CLA assistant check
All committers have signed the CLA.

jjfumero commented 5 months ago

Thank you @otabuzzman . This is awesome! I was planing to do something like this soon, so very timely. Give me a few days to check with my windows PC and try all instructions step by step.

otabuzzman commented 5 months ago

Take your time, I'm in no hurry ;-) but glad to hear you find it useful.

When I first started, I used the cmd.exe tool. Later I realized that using Python would have been better since it is necessary to run and test TornadoVM interactively anyway.

I now think that customizing the original installer should be possible with little effort and am considering giving it a try.

Juan Fumero @.***> schrieb am Di. 19. März 2024 um 22:02:

Thank you @otabuzzman https://github.com/otabuzzman . This is awesome! I was planing to do something like so very timely. Give me a few days to check with my windows PC and try all instructions step by step.

— Reply to this email directly, view it on GitHub https://github.com/beehive-lab/TornadoVM/pull/356#issuecomment-2008129691, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7PMXHPLH2OZFRETFMUUHLYZCRXRAVCNFSM6AAAAABE6IT42SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBYGEZDSNRZGE . You are receiving this because you were mentioned.Message ID: @.***>

jjfumero commented 5 months ago

I will start with the dependencies and then switch to this main repo.

jjfumero commented 5 months ago

I could make it work. However, depending on the backend, I get errors.

OpenCL:

python %TORNADO_SDK%\bin\tornado --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D

[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3070 (Device 0).

 -> Returned: -5
        Single Threaded CPU Execution: 2.63 GFlops, Total time = 102 ms
        Streams Execution: 16.78 GFlops, Total time = 16 ms
        TornadoVM Execution on GPU (Accelerated): 268.44 GFlops, Total Time = 1 ms
        Speedup: 102.0x
        Verification false

But the same kernel, running with SPIR-V (Level Zero) and CUDA PTX works fine:

python %TORNADO_SDK%\bin\tornado --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D

Task info: s0.t0
        Backend           : PTX
        Device            : NVIDIA GeForce RTX 3070 GPU
        Dims              : 2
        Thread dimensions : [512, 512]
        Blocks dimensions : [16, 16, 1]
        Grids dimensions  : [32, 32, 1]

        Single Threaded CPU Execution: 2.63 GFlops, Total time = 102 ms
        Streams Execution: 16.78 GFlops, Total time = 16 ms
        TornadoVM Execution on GPU (Accelerated): 268.44 GFlops, Total Time = 1 ms
        Speedup: 102.0x
        Verification true
python %TORNADO_SDK%\bin\tornado --threadInfo -m tornado.examples/uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D

Task info: s0.t0
        Backend           : SPIRV
        Device            : SPIRV LevelZero - Intel(R) UHD Graphics 770 GPU
        Dims              : 2
        Global work offset: [0, 0]
        Global work size  : [512, 512]
        Local  work size  : [512, 1, 1]
        Number of workgroups  : [1, 512]

        Single Threaded CPU Execution: 2.40 GFlops, Total time = 112 ms
        Streams Execution: 17.90 GFlops, Total time = 15 ms
        TornadoVM Execution on GPU (Accelerated): 22.37 GFlops, Total Time = 12 ms
        Speedup: 9.333333333333334x
        Verification true

It looks to me a driver issue, but this test passes on Linux and OSx.

OpenCL devices:

python %TORNADO_SDK%\bin\tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector
[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967295
[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967266
[TornadoVM-OCL-JNI] ERROR : clCreateContext -> Returned: -30

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 4
  Tornado device=0:0  (DEFAULT)
        OPENCL --  [NVIDIA CUDA] -- NVIDIA GeForce RTX 3070
                Global Memory Size: 8.0 GB
                Local Memory Size: 48.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 64]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:1
        OPENCL --  [Intel(R) OpenCL Graphics] -- Intel(R) UHD Graphics 770
                Global Memory Size: 12.7 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [512]
                Max WorkGroup Configuration: [512, 512, 512]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:2
        OPENCL --  [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i7-12700K
                Global Memory Size: 31.7 GB
                Local Memory Size: 32.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [8192]
                Max WorkGroup Configuration: [8192, 8192, 8192]
                Device OpenCL C version: OpenCL C 3.0

  Tornado device=0:3
        OPENCL --  [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
                Global Memory Size: 31.7 GB
                Local Memory Size: 256.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [67108864]
                Max WorkGroup Configuration: [67108864, 67108864, 67108864]
                Device OpenCL C version: OpenCL C 1.2

[TornadoVM-OCL-JNI] ERROR : clReleaseContext -> Returned: -34

The errors seems to be related to the FPGA, that we need to access in emulation mode.

jjfumero commented 5 months ago

Strange, with the OpenCL and my setup, nothing works. It looks to me a problem with my configuration:

python %TORNADO_SDK%\bin\tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector
[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967295
[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967266
[TornadoVM-OCL-JNI] ERROR : clCreateContext -> Returned: -30

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 4
  Tornado device=0:0  (DEFAULT)
        OPENCL --  [NVIDIA CUDA] -- NVIDIA GeForce RTX 3070
                Global Memory Size: 8.0 GB
                Local Memory Size: 48.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 64]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:1
        OPENCL --  [Intel(R) OpenCL Graphics] -- Intel(R) UHD Graphics 770
                Global Memory Size: 12.7 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [512]
                Max WorkGroup Configuration: [512, 512, 512]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:2
        OPENCL --  [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i7-12700K
                Global Memory Size: 31.7 GB
                Local Memory Size: 32.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [8192]
                Max WorkGroup Configuration: [8192, 8192, 8192]
                Device OpenCL C version: OpenCL C 3.0

  Tornado device=0:3
        OPENCL --  [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
                Global Memory Size: 31.7 GB
                Local Memory Size: 256.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [67108864]
                Max WorkGroup Configuration: [67108864, 67108864, 67108864]
                Device OpenCL C version: OpenCL C 1.2

[TornadoVM-OCL-JNI] ERROR : clReleaseContext -> Returned: -34

C:\Users\jjfum\source\repos\TornadoVM>python %TORNADO_SDK%\bin\tornado-test
python C:/Users/jjfum/source/repos/TornadoVM/bin/sdk/bin/tornado  --jvm "-Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.verbose=False "  -m  tornado.unittests/uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  --params "uk.ac.manchester.tornado.unittests.foundation.TestIntegers"
WARNING: Using incubator modules: jdk.incubator.vector

[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967295
[TornadoVM-OCL-JNI] ERROR : clGetDeviceIDs -> Returned: 4294967266
[TornadoVM-OCL-JNI] ERROR : clCreateContext -> Returned: -30
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
otabuzzman commented 5 months ago

Strange behavior, indeed. What oneAPI components are installed in your setup? In my there is only the Intel® CPU Runtime for OpenCL™ Applications with SYCL support. To make it work the steps given on the webpage in section Known Issues and Limitations needed to be applied.

What is that FPGA emulator? Can you switch it off?

jjfumero commented 5 months ago

In my case I installed the oneAPI Base Toolkit, which includes the FPGA emulation and other tools. I also have installed the Intel ARC GPU Drivers, since time to time, I switch my 3070 for the ARC 750 for experiments, and this might be causing the problem. The thing is:

I will dig in to investigate the problem, but good to know it works for you. I will also work with Thanos to try to reproduce this on a different machine.

jjfumero commented 5 months ago

Update:


> python %TORNADO_SDK%\bin\tornado --devices

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 4
  Tornado device=0:0  (DEFAULT)
        OPENCL --  [NVIDIA CUDA] -- NVIDIA GeForce RTX 3070
                Global Memory Size: 8.0 GB
                Local Memory Size: 48.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 64]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:1
        OPENCL --  [Intel(R) OpenCL Graphics] -- Intel(R) UHD Graphics 770
                Global Memory Size: 12.7 GB
                Local Memory Size: 64.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [512]
                Max WorkGroup Configuration: [512, 512, 512]
                Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:2
        OPENCL --  [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i7-12700K
                Global Memory Size: 31.7 GB
                Local Memory Size: 32.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [8192]
                Max WorkGroup Configuration: [8192, 8192, 8192]
                Device OpenCL C version: OpenCL C 3.0

  Tornado device=0:3
        OPENCL --  [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
                Global Memory Size: 31.7 GB
                Local Memory Size: 256.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [67108864]
                Max WorkGroup Configuration: [67108864, 67108864, 67108864]
                Device OpenCL C version: OpenCL C 1.2

> python %TORNADO_SDK%\bin\tornado-test -V

Test: class uk.ac.manchester.tornado.unittests.foundation.TestIntegers
        Running test: test01                     ................  [PASS]
        Running test: test03                     ................  [PASS]
        Running test: test04                     ................  [PASS]
        Running test: test05                     ................  [PASS]
        Running test: test06                     ................  [PASS]
        Running test: test07                     ................  [PASS]
        Running test: test02                     ................  [PASS]

Test: class uk.ac.manchester.tornado.unittests.foundation.TestFloats
        Running test: testFloatsCopy             ................  [PASS]
        Running test: testVectorFloatMul         ................  [PASS]
        Running test: testVectorFloatDiv         ................  [PASS]
        Running test: testVectorFloatAdd         ................  [PASS]
        Running test: testVectorFloatSub         ................  [PASS]

Test: class uk.ac.manchester.tornado.unittests.foundation.TestDoubles
        Running test: testDoublesMul             ................  [PASS]
        Running test: testDoublesCopy            ................  [PASS]
        Running test: testDoublesAdd             ................  [PASS]
        Running test: testDoublesDiv             ................  [PASS]
        Running test: testDoublesSub             ................  [PASS]

        ...
Test: class uk.ac.manchester.tornado.unittests.compute.ComputeTests
        Running test: testNBodyBigNoWorker       ................  [PASS]
        Running test: testBlackScholes           ................  [PASS]
        Running test: testHilbert                ................  [PASS]
        Running test: testNBodySmall             ................  [PASS]
        Running test: testDFTVectorTypes         ................  [PASS]
        Running test: matrixVector               ................  [PASS]
        Running test: testDFTFloat               ................  [PASS]
        Running test: testRenderTrack            ................  [PASS]
        Running test: testDFTDouble              ................  [PASS]
        Running test: testMandelbrot             ................  [FAILED]
                \_[REASON] expected:<8> but was:<9>
        Running test: testMontecarlo             ................  [PASS]
        Running test: matrixVectorFloat4         ................  [PASS]
        Running test: testJuliaSets              ................  [FAILED]
                \_[REASON] expected:<-1000.0> but was:<1.5197569>
        Running test: testNBody                  ................  [PASS]
        Running test: testEuler                  ................  [PASS]
        ...

==================================================
              Unit tests report
==================================================

{'[PASS]': 579, '[FAILED]': 16, '[UNSUPPORTED]': 22}
Coverage [PASS/(PASS+FAIL)]: 97.31%
Coverage [PASS/(PASS+FAIL+UNSUPPORTED)]: 93.84%

==================================================
....
jjfumero commented 5 months ago

Based on the previous test, I am more towards a misconfiguration regarding the OpenCL on my Windows 11.

jjfumero commented 5 months ago

I used the cmd.exe tool. Later I realized that using Python would have been better since it is necessary to run and test TornadoVM interactively anyway. I now think that customizing the original installer should be possible with little effort and am considering giving it a try.

Ok. My only concern is that, as it is, it kind of branches away from the style we have for Linux and OSx. To simplify the process of merging and review, my suggestion is that, for this iteration of the code, we move on with this CMD tool, and you can open a second PR with the Python migration if you want. Is this something you would like to try?

jjfumero commented 5 months ago

More updates regarding NVIDIA OpenCL support on Windows 11:

I am running out of ideas, but at least we know it is not due to the installation of oneAPI + ARC Drivers.

jjfumero commented 5 months ago

Ok, I think I got it.

So the error is printed by the Driver and captured in our JNI code to dispatch OpeNCL kernels:

[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -5
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3070 (Device 0).

This mainly suggests an issue with the block size. Since I noticed that smaller block sizes are executed correctly with OpenCL, I modified the Matrix Multiplication example in TorandoVM as follows:

TaskGraph taskGraph = new TaskGraph("s0") //
                .transferToDevice(DataTransferMode.FIRST_EXECUTION, matrixA, matrixB) //
                .task("t0", MatrixMultiplication2D::matrixMultiplication, matrixA, matrixB, matrixC, size) //
                .transferToHost(DataTransferMode.EVERY_EXECUTION, matrixC);

        ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();
        TornadoExecutionPlan executor = new TornadoExecutionPlan(immutableTaskGraph);

        WorkerGrid workerGrid = new WorkerGrid2D(matrixA.getNumRows(), matrixA.getNumColumns());
        GridScheduler gridScheduler = new GridScheduler("s0.t0", workerGrid);
        workerGrid.setLocalWork(16, 16, 1);

        executor.withGridScheduler(gridScheduler).withWarmUp();

Diff:

diff --git a/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java b/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
index 0426e2dbb..a28ed57c6 100644
--- a/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
+++ b/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/compute/MatrixMultiplication2D.java
@@ -20,9 +20,7 @@ package uk.ac.manchester.tornado.examples.compute;
 import java.util.Random;
 import java.util.stream.IntStream;

-import uk.ac.manchester.tornado.api.ImmutableTaskGraph;
-import uk.ac.manchester.tornado.api.TaskGraph;
-import uk.ac.manchester.tornado.api.TornadoExecutionPlan;
+import uk.ac.manchester.tornado.api.*;
 import uk.ac.manchester.tornado.api.annotations.Parallel;
 import uk.ac.manchester.tornado.api.enums.DataTransferMode;
 import uk.ac.manchester.tornado.api.enums.TornadoDeviceType;
@@ -97,7 +95,12 @@ public class MatrixMultiplication2D {

         ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();
         TornadoExecutionPlan executor = new TornadoExecutionPlan(immutableTaskGraph);
-        executor.withWarmUp();
+
+        WorkerGrid workerGrid = new WorkerGrid2D(matrixA.getNumRows(), matrixA.getNumColumns());
+        GridScheduler gridScheduler = new GridScheduler("s0.t0", workerGrid);
+        workerGrid.setLocalWork(16, 16, 1);
+
+        executor.withGridScheduler(gridScheduler).withWarmUp();

         // 1. Warm up Tornado
         for (int i = 0; i < WARMING_UP_ITERATIONS; i++) {

So I forced to execute in blocks of 16x16 instead of the default value of 32x32, and the execution is correct.

Task info: s0.t0
        Backend           : OPENCL
        Device            : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
        Dims              : 2
        Global work offset: [0, 0, 0]
        Global work size  : [512, 512, 1]
        Local  work size  : [16, 16, 1]
        Number of workgroups  : [32, 32, 1]

        Single Threaded CPU Execution: 2.58 GFlops, Total time = 104 ms
        Streams Execution: 15.79 GFlops, Total time = 17 ms
        TornadoVM Execution on GPU (Accelerated): 268.44 GFlops, Total Time = 1 ms
        Speedup: 104.0x
        Verification true

Takeaways:

otabuzzman commented 5 months ago

Totally fine. I'll try it and come back with a PR if I succeed.

Juan Fumero @.***> schrieb am Fr. 22. März 2024 um 07:40:

I used the cmd.exe tool. Later I realized that using Python would have been better since it is necessary to run and test TornadoVM interactively anyway. I now think that customizing the original installer should be possible with little effort and am considering giving it a try.

Ok. My only concern is that, as it is, it kind of branches away from the style we have for Linux and OSx. To simplify the process of merging and review, my suggestion is that, for this iteration of the code, we move on with this CMD tool, and you can open a second PR with the Python migration if you want. Is this something you would like to try?

— Reply to this email directly, view it on GitHub https://github.com/beehive-lab/TornadoVM/pull/356#issuecomment-2014452671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7PMXCHMGQMDGUCXDI4JCTYZPG7FAVCNFSM6AAAAABE6IT42SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJUGQ2TENRXGE . You are receiving this because you were mentioned.Message ID: @.***>

jjfumero commented 5 months ago

I will merge this. Awesome work @otabuzzman . Thank you!