beehive-lab / TornadoVM

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages
https://www.tornadovm.org
Apache License 2.0
1.16k stars 109 forks source link

[fix] Problem with FPGA execution for multiple tasks and the default scheduler #416

Closed stratika closed 2 months ago

stratika commented 2 months ago

Description

This PR provides a fix for the issue described in #401.

Note: This PR is tested on Intel Emulation mode. I do not have access to Xilinx FPGA to test it.

Problem description

There are two identified problems:

  1. In the OCLCodeCache class we have a method that checks if force compilation has been triggered, and the FPGA compilers for Intel are triggered to compile only if the check is true. This seems to have been an old check that we had from the time we had the lookupbuffer kernel, and we were waiting till all LAUNCH bytecodes that corresponds to all task indices (all tasks within a TaskGraph) are issued, in order to trigger the forceCompilation() method from the TornadoVM class. See here.
  2. The executor.withDefaultScheduler() configuration in the ExecutionPlan seems to break the execution and results in OpenCL error (CL_INVALID_WORK_GROUP_SIZE) when the clEnqueueNDRangeKernel function is invoked.

To fix the first problem, I removed the shouldCompile check that existed in OCLCodeCache. To my understanding this is an old check, and it is not required since we deprecated the lookupbuffer kernel.

To fix the second problem, I performed a short refactoring in the OCLKernelScheduler (i.e., an abstract class) and the OCLFPGAScheduler which extends the abstract class, to assess the default scheduling local work group for FPGAs when the executor.withDefaultScheduler() is enabled in a TornadoExecutionPlan.

This change made me think of testing also to run the BlurFilter example with a WorkerGrid, and applied a small update in the OCLGridInfo to check the default FPGA local work group.

Backend/s tested

Mark the backends affected by this PR.

OS tested

Mark the OS where this PR is tested.

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

How to test the new patch?

Then, you can run, as described also in the issue #401:

rm -rf fpga-source-comp
tornado --debug --threadInfo --jvm="-Dblur.red.device=0:3 -Dblur.green.device=0:3 -Dblur.blue.device=0:3 -Dtornado.recover.bailout=False" -m tornado.examples/uk.ac.manchester.tornado.examples.compute.BlurFilter

Output:

WARNING: Using incubator modules: jdk.incubator.vector
[DEBUG] JIT compilation for the FPGA
Task info: blur.red
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [448, 640]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.green
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [448, 640]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.blue
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [448, 640]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Parallel Total time: 
    ns = 2340610196
    seconds = 2.340610196
[DEBUG] JIT compilation for the FPGA
Task info: blur.red
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [448, 640]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.green
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [448, 640]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.blue
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [448, 640]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : [7, 640]

[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Parallel Total time: 
    ns = 156405899
    seconds = 0.15640589900000001

You can download, apply the patch and build TornadoVM:

git apply fpga_gridscheduler.patch
make

and then run the same example:

rm -rf fpga-source-comp
tornado --debug --threadInfo --jvm="-Dblur.red.device=0:3 -Dblur.green.device=0:3 -Dblur.blue.device=0:3 -Dtornado.recover.bailout=False" -m tornado.examples/uk.ac.manchester.tornado.examples.compute.BlurFilter

Output:

WARNING: Using incubator modules: jdk.incubator.vector
[DEBUG] JIT compilation for the FPGA
[TornadoVM] Warning: The loop bounds will be configured by the GridScheduler. Check the grid by using the flag --threadInfo.
[TornadoVM OCL] Warning: TornadoVM changed the user-defined local size to: [64, 1, 1].
Task info: blur.red
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0, 0]
    Global work size  : [448, 640, 1]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : null

Task info: blur.green
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0, 0]
    Global work size  : [448, 640, 1]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : null

Task info: blur.blue
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0, 0]
    Global work size  : [448, 640, 1]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : null

Parallel Total time: 
    ns = 2347304227
    seconds = 2.347304227
[DEBUG] JIT compilation for the FPGA
[TornadoVM OCL] Warning: TornadoVM changed the user-defined local size to: [64, 1, 1].
Task info: blur.red
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0, 0]
    Global work size  : [448, 640, 1]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : null

Task info: blur.green
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0, 0]
    Global work size  : [448, 640, 1]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : null

Task info: blur.blue
    Backend           : OPENCL
    Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
    Dims              : 2
    Global work offset: [0, 0, 0]
    Global work size  : [448, 640, 1]
    Local  work size  : [64, 1, 1]
    Number of workgroups  : null

Parallel Total time: 
    ns = 161529220
    seconds = 0.16152922

You can download, apply the patch and build TornadoVM:

git apply fpga_multiple_tasks.patch
make

and then run the same example:

rm -rf fpga-source-comp
tornado --threadInfo --jvm="-Dexample.foo.device=0:3 -Dexample.bar.device=0:3" -m tornado.examples/uk.ac.manchester.tornado.examples.MultipleTasks

jjfumero commented 2 months ago

I could reproduce the fix with my configuration. Thanks @stratika.