CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.
Other
166 stars 27 forks source link

hip_interop samples crashing at runtime and how to determine if chipStar is using a LZ command queue or LZ command list #780

Closed colleeneb closed 4 months ago

colleeneb commented 4 months ago

The background of this is that the hip_interop samples were crashing at runtime. Brice and I looked into this yesterday.

What we noticed is that in the hip_interop sample, the SYCL queue is created by getting LZ handlers from a call to hipGetBackendNativeHandles and then passing it to make_queue.

We note that in the nativeHandlers array in the hipinterop example we are always getting a LZ command queue. As you can see here getBackendHandles only stores and returns `ZeCmdQbut notZeCmdListImm_` which it looks like is in chipstar::Queue.

This causes an issue since the hip_interop code uses the environment variable CHIP_L0_IMM_CMD_LIST to switch between expecting a LZ immediate command list and LZ command queue (https://github.com/CHIP-SPV/chipStar/blob/3158b680715c6f4f445d71589d4d979d3ed6f731/samples/hip_sycl_interop/onemkl_gemm_wrapper/onemkl_gemm_wrapper.cpp#L117, and defaults to the LZ immediate command list), and so by default it will make a SYCL queue expecting that a LZ immediate command list was passed to it, when in fact a LZ command queue was passed to it (since the nativeHandlers array is only ever storing a LZ command queue).

We could get the example to work if we use CHIP_L0_IMM_CMD_LIST=0 so that the SYCL queue was made from the LZ command queue which the nativeHandlers array contained (and not expected an immediate command list).

But the root of the issue is that we need a good way to determine what chipStar is returning, a LZ command queue or a LZ command list. Two options Brice mentioned are: add a new query to see what LZ is returning, or add a new boolean argument to hipGetBackendNativeHandles to indicate which type is returned.

pvelesko commented 4 months ago

@colleeneb @Kerilk this is known https://github.com/CHIP-SPV/chipStar/issues/694

And I've only been testing RCL and it still didn't work for me so there must a bigger issue at hand. Did you test with 2024? Because 2023 works interop works it's the 2024 that doesn't.

I've removed the code that checks the environment variable and just always set isImmCmdList = false since we never return a command list even when ICL is on.

icpx 2024 + MKL 2023

Currently Loaded Modulefiles:
 1) oneapi/tbb/latest   2) oneapi/compiler-rt/latest   3) oneapi/oclfpga/latest   4) oneapi/compiler/2024.0.2   5) llvm/18.0-unpatched-spirv   6) oneapi/mkl/2023.2.3  
[ 96%] Linking CXX executable cuda-sobolqrng
/usr/bin/ld: onemkl_gemm_wrapper_no_buffers/libonemkl_gemm_wrapper_no_buffers.so: undefined reference to `sycl::_V1::ext::oneapi::level_zero::make_queue(sycl::_V1::context const&, sycl::_V1::device const&, unsigned long, bool, bool, sycl::_V1::property_list const&)'
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [samples/hip_sycl_interop_no_buffers/CMakeFiles/hip_sycl_interop_no_buffers.dir/build.make:101: samples/hip_sycl_interop_no_buffers/hip_sycl_interop_no_buffers] Error 1
make[1]: *** [CMakeFiles/Makefile2:23012: samples/hip_sycl_interop_no_buffers/CMakeFiles/hip_sycl_interop_no_buffers.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 96%] Built target cuda-sobolqrng
[ 96%] Linking CXX executable shuffles
[ 96%] Built target shuffles
/usr/bin/ld: onemkl_gemm_wrapper/libonemkl_gemm_wrapper.so: undefined reference to `sycl::_V1::ext::oneapi::level_zero::make_queue(sycl::_V1::context const&, sycl::_V1::device const&, unsigned long, bool, bool, sycl::_V1::property_list const&)'
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [samples/hip_sycl_interop/CMakeFiles/hip_sycl_interop.dir/build.make:102: samples/hip_sycl_interop/hip_sycl_interop] Error 1
make[1]: *** [CMakeFiles/Makefile2:22959: samples/hip_sycl_interop/CMakeFiles/hip_sycl_interop.dir/all] Error 2

icpx 2024 + mkl 2024

Currently Loaded Modulefiles:
 1) oneapi/tbb/latest   2) oneapi/compiler-rt/latest   3) oneapi/oclfpga/latest   4) oneapi/compiler/2024.0.2   5) llvm/18.0-unpatched-spirv   6) oneapi/mkl/2024.0  
1370: Test command: /space/pvelesko/chipStar/main/build/samples/hip_sycl_interop/hip_sycl_interop
1370: Test timeout computed to be: 60
1370: Intel MKL FATAL ERROR: Error on loading function 'clGetPlatformIDs'.

icpx 2023 + mkl 2023 works as always

Currently Loaded Modulefiles:
 1) oneapi/mkl/2023.2.3   2) oneapi/compiler/2023.2.3   3) llvm/18.0-unpatched-spirv  
1370: Test command: /space/pvelesko/chipStar/main/build/samples/hip_sycl_interop/hip_sycl_interop
1370: Test timeout computed to be: 60

1370: Verify results between OneMKL & Serial: SUCCESS - The results are correct!

1370: CHIP warning [TID 2250176] [1708650088.606763971] : Not all user created streams have been destoyed... Queues remaining: 1
1370: CHIP warning [TID 2250176] [1708650088.606891736] : Make sure to call hipStreamDestroy() for all queues that have been created via hipStreamCreate()
1370: CHIP warning [TID 2250176] [1708650088.606895333] : Removing user-created streams without calling a destructor
1/2 Test #1370: hip_sycl_interop .................   Passed    3.90 sec
test 1371
    Start 1371: hip_sycl_interop_no_buffers

1371: Test command: /space/pvelesko/chipStar/main/build/samples/hip_sycl_interop_no_buffers/hip_sycl_interop_no_buffers
1371: Test timeout computed to be: 60
1371: Verify results between OneMKL & Serial: SUCCESS - The results are correct!
1371: CHIP warning [TID 2250227] [1708650089.468494830] : Not all user created streams have been destoyed... Queues remaining: 1
1371: CHIP warning [TID 2250227] [1708650089.468586863] : Make sure to call hipStreamDestroy() for all queues that have been created via hipStreamCreate()
1371: CHIP warning [TID 2250227] [1708650089.468590377] : Removing user-created streams without calling a destructor
2/2 Test #1371: hip_sycl_interop_no_buffers ......   Passed    0.80 sec

The following tests passed:
        hip_sycl_interop
        hip_sycl_interop_no_buffers
pvelesko commented 4 months ago

Maybe it's just my systems but either way I don't see the need for the ICL/RCL query at all since we never return a command list - always the command queue.

I opened #781 please let me know if that resolves your issues when using icpx 2024 + mkl 2024 @colleeneb @Kerilk

colleeneb commented 4 months ago

Thanks! I'll try out the patch in #781 tomorrow!

I was testing 2024.0 (both MKL and SDK) on Aurora (oneapi/release/2023.12.15.001) with LZ backend and IMM=off, and that worked for both the examples (i.e. the SUCCESS - The results are correct!). I do see Not all user created streams have been destoyed though.

> CHIP_L0_IMM_CMD_LIST=0 ./samples/hip_sycl_interop_no_buffers/hip_sycl_interop_no_buffers 
Verify results between OneMKL & Serial: SUCCESS - The results are correct!
CHIP warning [TID 89254] [1708651711.335928654] : Not all user created streams have been destoyed... Queues remaining: 1
CHIP warning [TID 89254] [1708651711.335998060] : Make sure to call hipStreamDestroy() for all queues that have been created via hipStreamCreate()
CHIP warning [TID 89254] [1708651711.336001833] : Removing user-created streams without calling a destructor

> CHIP_L0_IMM_CMD_LIST=0 ./samples/hip_sycl_interop/hip_sycl_interop
Verify results between OneMKL & Serial: SUCCESS - The results are correct!
CHIP warning [TID 89276] [1708651721.954478094] : Not all user created streams have been destoyed... Queues remaining: 1
CHIP warning [TID 89276] [1708651721.954544758] : Make sure to call hipStreamDestroy() for all queues that have been created via hipStreamCreate()
CHIP warning [TID 89276] [1708651721.954548619] : Removing user-created streams without calling a destructor

icpx 2024 + MKL 2023 --> this I guess doesn't work due to the make_queue API changes and for icpx 2024 + mkl 2024 are you using OpenCL? (I'm wondering since the error is about clGetPlatformIDs). I didn't test the OpenCL backend, I'll do it tomorrow too! Thanks a lot!

And thanks for the pointer to #694, I missed it before!

pvelesko commented 4 months ago

and for icpx 2024 + mkl 2024 are you using OpenCL?

no, I only tested Level Zero.

pvelesko commented 4 months ago

@colleeneb please close if resolved

colleeneb commented 4 months ago

781 did result in the two hip_sycl_interop test working, thanks!

I did notice that the two sycl_chip_interop are segfaulting on Aurora, and I'm wondering if it's something similar going on. If I have a chance I'll look at it and open another issue if needed!