Closed colleeneb closed 4 months ago
@colleeneb @Kerilk this is known https://github.com/CHIP-SPV/chipStar/issues/694
And I've only been testing RCL and it still didn't work for me so there must a bigger issue at hand. Did you test with 2024? Because 2023 works interop works it's the 2024 that doesn't.
I've removed the code that checks the environment variable and just always set isImmCmdList = false
since we never return a command list even when ICL is on.
icpx 2024 + MKL 2023
Currently Loaded Modulefiles:
1) oneapi/tbb/latest 2) oneapi/compiler-rt/latest 3) oneapi/oclfpga/latest 4) oneapi/compiler/2024.0.2 5) llvm/18.0-unpatched-spirv 6) oneapi/mkl/2023.2.3
[ 96%] Linking CXX executable cuda-sobolqrng
/usr/bin/ld: onemkl_gemm_wrapper_no_buffers/libonemkl_gemm_wrapper_no_buffers.so: undefined reference to `sycl::_V1::ext::oneapi::level_zero::make_queue(sycl::_V1::context const&, sycl::_V1::device const&, unsigned long, bool, bool, sycl::_V1::property_list const&)'
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [samples/hip_sycl_interop_no_buffers/CMakeFiles/hip_sycl_interop_no_buffers.dir/build.make:101: samples/hip_sycl_interop_no_buffers/hip_sycl_interop_no_buffers] Error 1
make[1]: *** [CMakeFiles/Makefile2:23012: samples/hip_sycl_interop_no_buffers/CMakeFiles/hip_sycl_interop_no_buffers.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 96%] Built target cuda-sobolqrng
[ 96%] Linking CXX executable shuffles
[ 96%] Built target shuffles
/usr/bin/ld: onemkl_gemm_wrapper/libonemkl_gemm_wrapper.so: undefined reference to `sycl::_V1::ext::oneapi::level_zero::make_queue(sycl::_V1::context const&, sycl::_V1::device const&, unsigned long, bool, bool, sycl::_V1::property_list const&)'
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [samples/hip_sycl_interop/CMakeFiles/hip_sycl_interop.dir/build.make:102: samples/hip_sycl_interop/hip_sycl_interop] Error 1
make[1]: *** [CMakeFiles/Makefile2:22959: samples/hip_sycl_interop/CMakeFiles/hip_sycl_interop.dir/all] Error 2
icpx 2024 + mkl 2024
Currently Loaded Modulefiles:
1) oneapi/tbb/latest 2) oneapi/compiler-rt/latest 3) oneapi/oclfpga/latest 4) oneapi/compiler/2024.0.2 5) llvm/18.0-unpatched-spirv 6) oneapi/mkl/2024.0
1370: Test command: /space/pvelesko/chipStar/main/build/samples/hip_sycl_interop/hip_sycl_interop
1370: Test timeout computed to be: 60
1370: Intel MKL FATAL ERROR: Error on loading function 'clGetPlatformIDs'.
icpx 2023 + mkl 2023 works as always
Currently Loaded Modulefiles:
1) oneapi/mkl/2023.2.3 2) oneapi/compiler/2023.2.3 3) llvm/18.0-unpatched-spirv
1370: Test command: /space/pvelesko/chipStar/main/build/samples/hip_sycl_interop/hip_sycl_interop
1370: Test timeout computed to be: 60
1370: Verify results between OneMKL & Serial: SUCCESS - The results are correct!
1370: CHIP warning [TID 2250176] [1708650088.606763971] : Not all user created streams have been destoyed... Queues remaining: 1
1370: CHIP warning [TID 2250176] [1708650088.606891736] : Make sure to call hipStreamDestroy() for all queues that have been created via hipStreamCreate()
1370: CHIP warning [TID 2250176] [1708650088.606895333] : Removing user-created streams without calling a destructor
1/2 Test #1370: hip_sycl_interop ................. Passed 3.90 sec
test 1371
Start 1371: hip_sycl_interop_no_buffers
1371: Test command: /space/pvelesko/chipStar/main/build/samples/hip_sycl_interop_no_buffers/hip_sycl_interop_no_buffers
1371: Test timeout computed to be: 60
1371: Verify results between OneMKL & Serial: SUCCESS - The results are correct!
1371: CHIP warning [TID 2250227] [1708650089.468494830] : Not all user created streams have been destoyed... Queues remaining: 1
1371: CHIP warning [TID 2250227] [1708650089.468586863] : Make sure to call hipStreamDestroy() for all queues that have been created via hipStreamCreate()
1371: CHIP warning [TID 2250227] [1708650089.468590377] : Removing user-created streams without calling a destructor
2/2 Test #1371: hip_sycl_interop_no_buffers ...... Passed 0.80 sec
The following tests passed:
hip_sycl_interop
hip_sycl_interop_no_buffers
Maybe it's just my systems but either way I don't see the need for the ICL/RCL query at all since we never return a command list - always the command queue.
I opened #781 please let me know if that resolves your issues when using icpx 2024 + mkl 2024 @colleeneb @Kerilk
Thanks! I'll try out the patch in #781 tomorrow!
I was testing 2024.0 (both MKL and SDK) on Aurora (oneapi/release/2023.12.15.001
) with LZ backend and IMM=off, and that worked for both the examples (i.e. the SUCCESS - The results are correct!
). I do see Not all user created streams have been destoyed
though.
> CHIP_L0_IMM_CMD_LIST=0 ./samples/hip_sycl_interop_no_buffers/hip_sycl_interop_no_buffers
Verify results between OneMKL & Serial: SUCCESS - The results are correct!
CHIP warning [TID 89254] [1708651711.335928654] : Not all user created streams have been destoyed... Queues remaining: 1
CHIP warning [TID 89254] [1708651711.335998060] : Make sure to call hipStreamDestroy() for all queues that have been created via hipStreamCreate()
CHIP warning [TID 89254] [1708651711.336001833] : Removing user-created streams without calling a destructor
> CHIP_L0_IMM_CMD_LIST=0 ./samples/hip_sycl_interop/hip_sycl_interop
Verify results between OneMKL & Serial: SUCCESS - The results are correct!
CHIP warning [TID 89276] [1708651721.954478094] : Not all user created streams have been destoyed... Queues remaining: 1
CHIP warning [TID 89276] [1708651721.954544758] : Make sure to call hipStreamDestroy() for all queues that have been created via hipStreamCreate()
CHIP warning [TID 89276] [1708651721.954548619] : Removing user-created streams without calling a destructor
icpx 2024 + MKL 2023
--> this I guess doesn't work due to the make_queue API changes and for icpx 2024 + mkl 2024
are you using OpenCL? (I'm wondering since the error is about clGetPlatformIDs
). I didn't test the OpenCL backend, I'll do it tomorrow too! Thanks a lot!
And thanks for the pointer to #694, I missed it before!
and for icpx 2024 + mkl 2024 are you using OpenCL?
no, I only tested Level Zero.
@colleeneb please close if resolved
I did notice that the two sycl_chip_interop
are segfaulting on Aurora, and I'm wondering if it's something similar going on. If I have a chance I'll look at it and open another issue if needed!
The background of this is that the hip_interop samples were crashing at runtime. Brice and I looked into this yesterday.
What we noticed is that in the hip_interop sample, the SYCL queue is created by getting LZ handlers from a call to hipGetBackendNativeHandles and then passing it to make_queue.
We note that in the nativeHandlers array in the hipinterop example we are always getting a LZ command queue. As you can see here getBackendHandles only stores and returns `ZeCmdQ
but not
ZeCmdListImm_` which it looks like is in chipstar::Queue.This causes an issue since the hip_interop code uses the environment variable
CHIP_L0_IMM_CMD_LIST
to switch between expecting a LZ immediate command list and LZ command queue (https://github.com/CHIP-SPV/chipStar/blob/3158b680715c6f4f445d71589d4d979d3ed6f731/samples/hip_sycl_interop/onemkl_gemm_wrapper/onemkl_gemm_wrapper.cpp#L117, and defaults to the LZ immediate command list), and so by default it will make a SYCL queue expecting that a LZ immediate command list was passed to it, when in fact a LZ command queue was passed to it (since the nativeHandlers array is only ever storing a LZ command queue).We could get the example to work if we use
CHIP_L0_IMM_CMD_LIST=0
so that the SYCL queue was made from the LZ command queue which the nativeHandlers array contained (and not expected an immediate command list).But the root of the issue is that we need a good way to determine what chipStar is returning, a LZ command queue or a LZ command list. Two options Brice mentioned are: add a new query to see what LZ is returning, or add a new boolean argument to hipGetBackendNativeHandles to indicate which type is returned.