CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.
Other
184 stars 29 forks source link

myocyte benchmark from HeCBench significantly slower with chipStar than with SYCL [LevelZero backend] #599

Open franz opened 1 year ago

franz commented 1 year ago

Without immediate queues, chipStar is ~100x slower, with immediate queues it is ~10x slower. My initial examination seems to point to many (possibly unnecessary) barrier commands, but anyway this needs to be investigated.

pvelesko commented 1 year ago

You should run it through iprof to see if the kernels themselves aren't extra slow. Are atomics used?

On Fri, Aug 25, 2023 at 17:33 Michal Babej @.***> wrote:

Without immediate queues, chipStar is ~100x slower, with immediate queues it is ~10x slower. My initial examination seems to point to many (possibly unnecessary) barrier commands, but anyway this needs to be investigated.

— Reply to this email directly, view it on GitHub https://github.com/CHIP-SPV/chipStar/issues/599, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCJBQLY7HDCDJZTTYBN453XXCZTXANCNFSM6AAAAAA36TFTGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pjaaskel commented 1 year ago

There's something fishy with the immediate command lists and the barriers. GAMESS fails with a lot of these errors flooded to log when I enable ICL:

CHIP error [TID 32150] [1692974685.965693710] : hipErrorTbd (ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY ) in /home/pjaaskel/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:1351:enqueueBarrierImpl

CHIP error [TID 32150] [1692974685.975379722] : Caught Error: hipErrorTbd
CHIP error [TID 32150] [1692974685.977357058] : hipErrorTbd (ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY ) in /home/pjaaskel/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:1351:enqueueBarrierImpl

This could be related to the reported issue here (suspected excessive barrier usage).

pvelesko commented 1 year ago

What GPU is being used?

On Fri, Aug 25, 2023 at 17:50 Pekka Jääskeläinen @.***> wrote:

There's something fishy with the immediate command lists and the barriers. GAMESS fails with a lot of these errors flooded to log when I enable ICL:

CHIP error [TID 32150] [1692974685.965693710] : hipErrorTbd (ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY ) in /home/pjaaskel/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:1351:enqueueBarrierImpl

CHIP error [TID 32150] [1692974685.975379722] : Caught Error: hipErrorTbd CHIP error [TID 32150] [1692974685.977357058] : hipErrorTbd (ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY ) in /home/pjaaskel/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:1351:enqueueBarrierImpl

This could be related to the reported issue here (suspected excessive barrier usage).

— Reply to this email directly, view it on GitHub https://github.com/CHIP-SPV/chipStar/issues/599#issuecomment-1693490517, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCJBQIJNY73YDTOGCUIQN3XXC3RXANCNFSM6AAAAAA36TFTGQ . You are receiving this because you commented.Message ID: @.***>

pjaaskel commented 1 year ago

In my case, the iGPU, in Michal's a PVC.

pjaaskel commented 1 year ago

I opened a separate issue (#612) of the still occuring problem of mine above.

franz commented 11 months ago

Removing the barriers (+using event dependencies) significantly reduced the difference (to ~4x slower), but there was also a kernel problem - SYCL was using fast-math by default, and the kernels call pow/exp a lot, so SYCL was using native_pow / native_exp. Recompiling the SYCL without fast-math brought the difference down to 1.3x-1.4x.

pjaaskel commented 11 months ago

30-40% is still significant. Any clue what drags chipStar down still?

franz commented 11 months ago

@pjaaskel no, not yet.

zjin-lcf commented 11 months ago

@franz

Do you mean the SYCL compiler enables fast math by default ? I checked Makefile and it does not have the fast math flag.

linehill commented 11 months ago

Do you mean the SYCL compiler enables fast math by default ? I checked Makefile and it does not have the fast math flag.

This depends on the compiler. Intel compiler icpx sets fast math flag on (and also sets optimization level to -O2) by default while while GCC and Clang does not.