[OCL] CPU Block Scheduler disabled by default and option to switch between thread schedulers

Description

This PR provides an enhancement for the thread-block scheduler when running on the CPUs. The default thread scheduler assigns a CPU thread per CPU core. This might not be the best strategy and it really depends on the OpenCL runtime/driver implementation. This patch, provides a switch for CPU-Block versus fine-grained scheduler, and it sets, by default, the CPU thread scheduler to the fine-grained.

Problem description

The main problem is performance. For example, when running on the CPU using the PoCL OpenCL implementation, the CPU implementation takes, in average ~46 seconds to complete, while the Intel oneAPI takes ~5 seconds.

If, instead of the block-scheduler for CPU, we use the "iteration" of the fine-grained scheduler, TornadoVM with PoCL runs in ~2.9-3.2 (s) per iteration.

Here's a trace with PoCL with block and without block scheduler: The application is taken from the TornadoVM-Examples repository: https://github.com/jjfumero/tornadovm-examples

$ tornado --jvm="-Dtornado.scheduler.block=True" --threadInfo  -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:3 

WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
    Backend           : OPENCL
    Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [20, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

Task info: blur.green
    Backend           : OPENCL
    Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [20, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

Task info: blur.blue
    Backend           : OPENCL
    Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [20, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

TornadoVM Total Time (ns) = 47729662509 -- seconds = 47.729662509

Using the fine-grained scheduler:

tornado --jvm="-Dtornado.scheduler.block=False" --threadInfo  -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:3 
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
    Backend           : OPENCL
    Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [3888, 5184]
    Local  work size  : [54, 64, 1]
    Number of workgroups  : [72, 81]

Task info: blur.green
    Backend           : OPENCL
    Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [3888, 5184]
    Local  work size  : [54, 64, 1]
    Number of workgroups  : [72, 81]

Task info: blur.blue
    Backend           : OPENCL
    Device            : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [3888, 5184]
    Local  work size  : [54, 64, 1]
    Number of workgroups  : [72, 81]

TornadoVM Total Time (ns) = 3513510495 -- seconds = 3.5135104950000002

Speedup on CPUs is 13x compared to the previous default scheduler in TornadoVM.

We can find also speedups using the Intel oneAPI OpenCL runtime instead of PoCL:

With Block Scheduler:

tornado --jvm="-Dtornado.scheduler.block=True" --threadInfo  -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:2 
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
    Backend           : OPENCL
    Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [20, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

Task info: blur.green
    Backend           : OPENCL
    Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [20, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

Task info: blur.blue
    Backend           : OPENCL
    Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [20, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

TornadoVM Total Time (ns) = 5649483400 -- seconds = 5.6494834

With fine-grained scheduler:

tornado --jvm="-Dtornado.scheduler.block=False" --threadInfo  -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:2 
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
    Backend           : OPENCL
    Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [3888, 5184]
    Local  work size  : [81, 81, 1]
    Number of workgroups  : [48, 64]

Task info: blur.green
    Backend           : OPENCL
    Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [3888, 5184]
    Local  work size  : [81, 81, 1]
    Number of workgroups  : [48, 64]

Task info: blur.blue
    Backend           : OPENCL
    Device            : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [3888, 5184]
    Local  work size  : [81, 81, 1]
    Number of workgroups  : [48, 64]

TornadoVM Total Time (ns) = 2347847562 -- seconds = 2.347847562

Speedup from the first iteration is 2.4x

Performance Gains

Using PoCL, this is the performance graph compared to the version in develop (gets block-thread as default) vs this branch. The Baseline is block thread and, if the value is higher than 1, the iteration-scheduler (fine-grained) is faster.

Fine-Grained Iteration vs Block Iteration (Speedups)

Interactive graph:

https://docs.google.com/spreadsheets/d/e/2PACX-1vRJiSmP8Hewlbkcm6jyfagSd0u7_X06NF8eiWNCmjpLfLd6np6uA0qO3QIhlIopg8CZ0u1bVdm__XSG/pubchart?oid=1645903604&format=interactive

Speedups ranges from 5% to 13x in average.

Backend/s tested

Mark the backends affected by this PR.

[X] OpenCL
[ ] PTX
[ ] SPIRV

OS tested

Mark the OS where this PR is tested.

[X] Linux
[ ] OSx
[ ] Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

[ ] Yes
[X] No

How to test the new patch?

make

beehive-lab / TornadoVM