[jit][cpu] Disable loop-interchange for CPU offloading

jjfumero commented 2 months ago

Description

This PR disables the loop interchange when the code is specialised for the CPU multi-core.

When running on Intel CPU i9-10885H and OpenCL CPU Runtime 2024.17.3.0.08_160000 from oneAPI, the time that takes to compute the Blur Filter goes from ~57 seconds in the first iteration down to ~7 seconds (speedup of 7.9x).

Execution trace in develop: https://github.com/beehive-lab/TornadoVM/commit/fcdebe513dac9a06b569669f8322e72ed1725a55

tornado --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:2
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
    Backend           : OPENCL
    Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [16, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

Task info: blur.green
    Backend           : OPENCL
    Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [16, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

Task info: blur.blue
    Backend           : OPENCL
    Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [16, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

TornadoVM Total Time (ns) = 57798473152 -- seconds = 57.79847315200001

Execution trace with this feature:

tornado --threadInfo -cp target/tornadovm-examples-1.0-SNAPSHOT.jar io.github.jjfumero.BlurFilter tornado device=0:2
WARNING: Using incubator modules: jdk.incubator.vector
Task info: blur.red
    Backend           : OPENCL
    Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [16, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

Task info: blur.green
    Backend           : OPENCL
    Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [16, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

Task info: blur.blue
    Backend           : OPENCL
    Device            : Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz CL_DEVICE_TYPE_CPU (available)
    Dims              : 2
    Global work offset: [0, 0]
    Global work size  : [16, 1]
    Local  work size  : null
    Number of workgroups  : [0, 0]

TornadoVM Total Time (ns) = 7243037295 -- seconds = 7.243037295000001

Problem description

The TornadoVM JIT compiler specialises the thread-block when compiling and deploying the application for multi-core CPUs. By default, the TornadoVM JIT compiler transform from 2D to 1D kernels. The problem is that, when having 2D kernels, the loop interchange might end-up running slower than expected due to the inner loop having more work to do per thread. This PR disables loop interchange for this case.

Backend/s tested

Mark the backends affected by this PR.

[X] OpenCL
[ ] PTX
[ ] SPIRV

OS tested

Mark the OS where this PR is tested.

[X] Linux
[ ] OSx
[ ] Windows

Did you check on FPGAs?

If it is applicable, check your changes on FPGAs.

[ ] Yes
[X] No

How to test the new patch?

make
make tests

mairooni commented 2 months ago

As far as I understand, the loop interchange is always disabled when running on the CPU. But is this only beneficial when the inner loop has more work than the outer loop? If so, does the performance deteriorate when this is not the case?

jjfumero commented 2 months ago

The performance decreases, yes. I think we should only enable this optimization under demand explicitly. This is because, for CPUs, TornadoVM selects, as default, 1D with number of threads equal to number of visible CPU cores.

mairooni commented 2 months ago

The performance decreases, yes. I think we should only enable this optimization under demand explicitly. This is because, for CPUs, TornadoVM selects, as default, 1D with number of threads equal to number of visible CPU cores.

In this case, should we add a flag to allow developers to enable it if suitable?

jjfumero commented 2 months ago

The flag already exists. In this PR is -Dtornado.loops.reverse=False but this has been refactored in other PRs.

beehive-lab / TornadoVM