This PR provides an enhancement for the thread-block scheduler when running on the CPUs. The default thread scheduler assigns a CPU thread per CPU core. This might not be the best strategy and it really depends on the OpenCL runtime/driver implementation. This patch, provides a switch for CPU-Block versus fine-grained scheduler, and it sets, by default, the CPU thread scheduler to the fine-grained.
Problem description
The main problem is performance. For example, when running on the CPU using the PoCL OpenCL implementation, the CPU implementation takes, in average ~46 seconds to complete, while the Intel oneAPI takes ~5 seconds.
If, instead of the block-scheduler for CPU, we use the "iteration" of the fine-grained scheduler, TornadoVM with PoCL runs in ~2.9-3.2 (s) per iteration.
$ tornado --jvm="-Dtornado.scheduler.block=True" --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:3
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
Backend : OPENCL
Device : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [20, 1]
Local work size : null
Number of workgroups : [0, 0]
Task info: blur.green
Backend : OPENCL
Device : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [20, 1]
Local work size : null
Number of workgroups : [0, 0]
Task info: blur.blue
Backend : OPENCL
Device : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [20, 1]
Local work size : null
Number of workgroups : [0, 0]
TornadoVM Total Time (ns) = 47729662509 -- seconds = 47.729662509
Using the fine-grained scheduler:
tornado --jvm="-Dtornado.scheduler.block=False" --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:3
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
Backend : OPENCL
Device : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [3888, 5184]
Local work size : [54, 64, 1]
Number of workgroups : [72, 81]
Task info: blur.green
Backend : OPENCL
Device : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [3888, 5184]
Local work size : [54, 64, 1]
Number of workgroups : [72, 81]
Task info: blur.blue
Backend : OPENCL
Device : cpu-haswell-12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [3888, 5184]
Local work size : [54, 64, 1]
Number of workgroups : [72, 81]
TornadoVM Total Time (ns) = 3513510495 -- seconds = 3.5135104950000002
Speedup on CPUs is 13x compared to the previous default scheduler in TornadoVM.
We can find also speedups using the Intel oneAPI OpenCL runtime instead of PoCL:
With Block Scheduler:
tornado --jvm="-Dtornado.scheduler.block=True" --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:2
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
Backend : OPENCL
Device : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [20, 1]
Local work size : null
Number of workgroups : [0, 0]
Task info: blur.green
Backend : OPENCL
Device : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [20, 1]
Local work size : null
Number of workgroups : [0, 0]
Task info: blur.blue
Backend : OPENCL
Device : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [20, 1]
Local work size : null
Number of workgroups : [0, 0]
TornadoVM Total Time (ns) = 5649483400 -- seconds = 5.6494834
With fine-grained scheduler:
tornado --jvm="-Dtornado.scheduler.block=False" --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:2
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
Backend : OPENCL
Device : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [3888, 5184]
Local work size : [81, 81, 1]
Number of workgroups : [48, 64]
Task info: blur.green
Backend : OPENCL
Device : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [3888, 5184]
Local work size : [81, 81, 1]
Number of workgroups : [48, 64]
Task info: blur.blue
Backend : OPENCL
Device : 12th Gen Intel(R) Core(TM) i7-12700K CL_DEVICE_TYPE_CPU (available)
Dims : 2
Global work offset: [0, 0]
Global work size : [3888, 5184]
Local work size : [81, 81, 1]
Number of workgroups : [48, 64]
TornadoVM Total Time (ns) = 2347847562 -- seconds = 2.347847562
Speedup from the first iteration is 2.4x
Performance Gains
Using PoCL, this is the performance graph compared to the version in develop (gets block-thread as default) vs this branch. The Baseline is block thread and, if the value is higher than 1, the iteration-scheduler (fine-grained) is faster.
Description
This PR provides an enhancement for the thread-block scheduler when running on the CPUs. The default thread scheduler assigns a CPU thread per CPU core. This might not be the best strategy and it really depends on the OpenCL runtime/driver implementation. This patch, provides a switch for CPU-Block versus fine-grained scheduler, and it sets, by default, the CPU thread scheduler to the fine-grained.
Problem description
The main problem is performance. For example, when running on the CPU using the PoCL OpenCL implementation, the CPU implementation takes, in average ~46 seconds to complete, while the Intel oneAPI takes ~5 seconds.
If, instead of the block-scheduler for CPU, we use the "iteration" of the fine-grained scheduler, TornadoVM with PoCL runs in ~2.9-3.2 (s) per iteration.
Here's a trace with PoCL with block and without block scheduler: The application is taken from the TornadoVM-Examples repository: https://github.com/jjfumero/tornadovm-examples
Using the fine-grained scheduler:
Speedup on CPUs is 13x compared to the previous default scheduler in TornadoVM.
We can find also speedups using the Intel oneAPI OpenCL runtime instead of PoCL:
With Block Scheduler:
With fine-grained scheduler:
Speedup from the first iteration is 2.4x
Performance Gains
Using PoCL, this is the performance graph compared to the version in develop (gets block-thread as default) vs this branch. The Baseline is block thread and, if the value is higher than 1, the iteration-scheduler (fine-grained) is faster.
Interactive graph:
https://docs.google.com/spreadsheets/d/e/2PACX-1vRJiSmP8Hewlbkcm6jyfagSd0u7_X06NF8eiWNCmjpLfLd6np6uA0qO3QIhlIopg8CZ0u1bVdm__XSG/pubchart?oid=1645903604&format=interactive
Speedups ranges from 5% to 13x in average.
Backend/s tested
Mark the backends affected by this PR.
OS tested
Mark the OS where this PR is tested.
Did you check on FPGAs?
If it is applicable, check your changes on FPGAs.
How to test the new patch?