This PR provides a fix for the issue described in #401.
Note: This PR is tested on Intel Emulation mode. I do not have access to Xilinx FPGA to test it.
Problem description
There are two identified problems:
In the OCLCodeCache class we have a method that checks if force compilation has been triggered, and the FPGA compilers for Intel are triggered to compile only if the check is true. This seems to have been an old check that we had from the time we had the lookupbuffer kernel, and we were waiting till all LAUNCH bytecodes that corresponds to all task indices (all tasks within a TaskGraph) are issued, in order to trigger the forceCompilation() method from the TornadoVM class. See here.
The executor.withDefaultScheduler() configuration in the ExecutionPlan seems to break the execution and results in OpenCL error (CL_INVALID_WORK_GROUP_SIZE) when the clEnqueueNDRangeKernel function is invoked.
To fix the first problem, I removed the shouldCompile check that existed in OCLCodeCache. To my understanding this is an old check, and it is not required since we deprecated the lookupbuffer kernel.
To fix the second problem, I performed a short refactoring in the OCLKernelScheduler (i.e., an abstract class) and the OCLFPGAScheduler which extends the abstract class, to assess the default scheduling local work group for FPGAs when the executor.withDefaultScheduler() is enabled in a TornadoExecutionPlan.
This change made me think of testing also to run the BlurFilter example with a WorkerGrid, and applied a small update in the OCLGridInfo to check the default FPGA local work group.
Backend/s tested
Mark the backends affected by this PR.
[x] OpenCL
[ ] PTX
[ ] SPIRV
OS tested
Mark the OS where this PR is tested.
[x] Linux
[ ] OSx
[ ] Windows
Did you check on FPGAs?
If it is applicable, check your changes on FPGAs.
[x] Yes
[ ] No
How to test the new patch?
First you need to have an image in the /tmp directory named (image.jpg). You can use this one:
Then, you can run, as described also in the issue #401:
WARNING: Using incubator modules: jdk.incubator.vector
[DEBUG] JIT compilation for the FPGA
Task info: blur.red
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [448, 640]
Local work size : [64, 1, 1]
Number of workgroups : [7, 640]
[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.green
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [448, 640]
Local work size : [64, 1, 1]
Number of workgroups : [7, 640]
[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.blue
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [448, 640]
Local work size : [64, 1, 1]
Number of workgroups : [7, 640]
[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Parallel Total time:
ns = 2340610196
seconds = 2.340610196
[DEBUG] JIT compilation for the FPGA
Task info: blur.red
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [448, 640]
Local work size : [64, 1, 1]
Number of workgroups : [7, 640]
[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.green
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [448, 640]
Local work size : [64, 1, 1]
Number of workgroups : [7, 640]
[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Task info: blur.blue
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [448, 640]
Local work size : [64, 1, 1]
Number of workgroups : [7, 640]
[TornadoVM OCL] Warning: TornadoVM uses as default local work group size for FPGAs: [64, 1, 1].
Parallel Total time:
ns = 156405899
seconds = 0.15640589900000001
To test with GridScheduler and FPGAs, I have this patch that uses a grid with a local work group (128, 1, 1) that is different from the default one.
You can download, apply the patch and build TornadoVM:
WARNING: Using incubator modules: jdk.incubator.vector
[DEBUG] JIT compilation for the FPGA
[TornadoVM] Warning: The loop bounds will be configured by the GridScheduler. Check the grid by using the flag --threadInfo.
[TornadoVM OCL] Warning: TornadoVM changed the user-defined local size to: [64, 1, 1].
Task info: blur.red
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0, 0]
Global work size : [448, 640, 1]
Local work size : [64, 1, 1]
Number of workgroups : null
Task info: blur.green
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0, 0]
Global work size : [448, 640, 1]
Local work size : [64, 1, 1]
Number of workgroups : null
Task info: blur.blue
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0, 0]
Global work size : [448, 640, 1]
Local work size : [64, 1, 1]
Number of workgroups : null
Parallel Total time:
ns = 2347304227
seconds = 2.347304227
[DEBUG] JIT compilation for the FPGA
[TornadoVM OCL] Warning: TornadoVM changed the user-defined local size to: [64, 1, 1].
Task info: blur.red
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0, 0]
Global work size : [448, 640, 1]
Local work size : [64, 1, 1]
Number of workgroups : null
Task info: blur.green
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0, 0]
Global work size : [448, 640, 1]
Local work size : [64, 1, 1]
Number of workgroups : null
Task info: blur.blue
Backend : OPENCL
Device : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
Dims : 2
Global work offset: [0, 0, 0]
Global work size : [448, 640, 1]
Local work size : [64, 1, 1]
Number of workgroups : null
Parallel Total time:
ns = 161529220
seconds = 0.16152922
To test with multiple tasks that will run on the FPGA, I have this patch that updates the MultipleTasks example that runs two kernels on the FPGA.
You can download, apply the patch and build TornadoVM:
Description
This PR provides a fix for the issue described in #401.
Note: This PR is tested on Intel Emulation mode. I do not have access to Xilinx FPGA to test it.
Problem description
There are two identified problems:
OCLCodeCache
class we have a method that checks if force compilation has been triggered, and the FPGA compilers for Intel are triggered to compile only if the check is true. This seems to have been an old check that we had from the time we had thelookupbuffer
kernel, and we were waiting till allLAUNCH
bytecodes that corresponds to all task indices (all tasks within aTaskGraph
) are issued, in order to trigger theforceCompilation()
method from theTornadoVM
class. See here.executor.withDefaultScheduler()
configuration in theExecutionPlan
seems to break the execution and results in OpenCL error (CL_INVALID_WORK_GROUP_SIZE
) when theclEnqueueNDRangeKernel
function is invoked.To fix the first problem, I removed the
shouldCompile
check that existed inOCLCodeCache
. To my understanding this is an old check, and it is not required since we deprecated thelookupbuffer
kernel.To fix the second problem, I performed a short refactoring in the
OCLKernelScheduler
(i.e., an abstract class) and theOCLFPGAScheduler
which extends the abstract class, to assess the default scheduling local work group for FPGAs when theexecutor.withDefaultScheduler()
is enabled in aTornadoExecutionPlan
.This change made me think of testing also to run the BlurFilter example with a WorkerGrid, and applied a small update in the
OCLGridInfo
to check the default FPGA local work group.Backend/s tested
Mark the backends affected by this PR.
OS tested
Mark the OS where this PR is tested.
Did you check on FPGAs?
If it is applicable, check your changes on FPGAs.
How to test the new patch?
Then, you can run, as described also in the issue #401:
Output:
You can download, apply the patch and build TornadoVM:
and then run the same example:
Output:
MultipleTasks
example that runs two kernels on the FPGA.You can download, apply the patch and build TornadoVM:
and then run the same example: