CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.
Other
182 stars 29 forks source link

sycl_interop fails: "Can't compile module. Level zero does not support multi-input compilation." #735

Closed pjaaskel closed 8 months ago

pjaaskel commented 8 months ago

This is with the latest compute-runtime drivers and intel-oneapi-dpcpp-cpp-2023.2.1.

731 doesn't help. @linehill "multi-input" rings the bell related to the SPIR-V device-feature-based linking. Could that cause issues like this?

linehill commented 8 months ago

Maybe the level0 driver does not support ZE_experimental_module_program anymore. I hope the extension is just promoted non-experimental extension.

pjaaskel commented 8 months ago

But should it fail more tests?

My level zero config now (still fails):

dpkg -l|grep -i "level zero"
ii  intel-level-zero-gpu                             1.3.27191.9                             amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-level-zero-gpu-dev                         1.3.26241.33-647~22.04                  amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero
ii  level-zero                                       1.14.0                                  amd64        oneAPI Level Zero
ii  level-zero-devel                                 1.14.0                                  amd64        oneAPI Level Zero
pjaaskel commented 8 months ago

@Sarbojit2019 does the interop work for you? Which packages you have installed?

pjaaskel commented 8 months ago

Confirmed that it loads the latest LZ driver:

strace samples/sycl_hip_interop/sycl_hip_interop_driver/sycl_chip_interop 2>&1|egrep "libze.*so"|grep -v "ENOENT"
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libze_loader.so.1", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libze_intel_gpu.so.1", O_RDONLY|O_CLOEXEC) = 3
pjaaskel commented 8 months ago

Interestingly, my RC3 build started to fail with the usual LLVM cmdline conflict problem after I rebuilt it:

samples/sycl_hip_interop/sycl_hip_interop_driver/sycl_chip_interop
sycl_chip_interop: /home/pjaaskel/Downloads/chipStar-1.1/llvm-project/llvm/include/llvm/Support/CommandLine.h:864: void llvm::cl::parser<DataType>::addLiteralOption(llvm::StringRef, const DT&, llvm::StringRef) [with DT = llvm::FunctionPass* (*)(); DataType = llvm::FunctionPass* (*)()]: Assertion `findOption(Name) == Values.size() && "Option already exists!"' failed.
pvelesko commented 8 months ago

hip_sycl segfaults

Starting program: /home/pvelesko/chipStar/build/samples/hip_sycl_interop/hip_sycl_interop
warning: File "/opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
    add-auto-load-safe-path /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py
line to your configuration file "/home/pvelesko/.config/gdb/gdbinit".
To completely disable this security protection add
    set auto-load safe-path /
line to your configuration file "/home/pvelesko/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
    info "(gdb)Auto-loading safe path"

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
CHIP debug [TID 111945] [1703150689.844275005] : CHIP_PLATFORM=0
CHIP debug [TID 111945] [1703150689.844338789] : CHIP_DEVICE_TYPE=default
CHIP debug [TID 111945] [1703150689.844341388] : CHIP_DEVICE=0
CHIP debug [TID 111945] [1703150689.844346830] : CHIP_BE=level0
CHIP debug [TID 111945] [1703150689.844348818] : CHIP_DUMP_SPIRV=off
CHIP debug [TID 111945] [1703150689.844351751] : CHIP_JIT_FLAGS_OVERRIDE=-cl-kernel-arg-info -cl-std=CL3.0
CHIP debug [TID 111945] [1703150689.844353863] : CHIP_L0_IMM_CMD_LISTS=on
CHIP debug [TID 111945] [1703150689.844355700] : CHIP_L0_COLLECT_EVENTS_TIMEOUT=0
CHIP debug [TID 111945] [1703150689.844383922] : __hipRegisterFatBinary
CHIP debug [TID 111945] [1703150689.844387041] : CHIPDriver Initialize
CHIP debug [TID 111945] [1703150689.844388841] : CHIPBE=LEVEL0... Initializing Level0 Backend
CHIP debug [TID 111945] [1703150689.844397758] : Backend Base Constructor
[New Thread 0x7fffc7dff640 (LWP 111948)]
[New Thread 0x7fffc75fe640 (LWP 111949)]
[New Thread 0x7fffc6dfd640 (LWP 111950)]
CHIP debug [TID 111945] [1703150689.945490404] : Driver version: 1.3.27191
[Thread 0x7fffc75fe640 (LWP 111949) exited]
CHIP debug [TID 111945] [1703150689.945539822] : Queue() 0x555557e99d30
CHIP debug [TID 111945] [1703150689.945643113] : Starting chipstar::Event Monitor Thread
[New Thread 0x7fffc65fc640 (LWP 111951)]
[Thread 0x7fffc6dfd640 (LWP 111950) exited]
CHIP debug [TID 111945] [1703150689.945781995] : Thread Created with ID : 140736521553472
CHIP debug [TID 111945] [1703150689.945789534] : Starting chipstar::Event Monitor Thread
[New Thread 0x7fffc5dfb640 (LWP 111952)]
CHIP debug [TID 111945] [1703150689.945959315] : Thread Created with ID : 140736513160768
CHIP debug [TID 111945] [1703150689.945977966] : Bundle entry ID 0 is: 'host-x86_64-unknown-linux--'

CHIP debug [TID 111945] [1703150689.945981228] : Not a SPIR-V triple, ignoring

CHIP debug [TID 111945] [1703150689.945985923] : Bundle entry ID 1 is: 'hip-spirv64----generic'

CHIP debug [TID 111945] [1703150689.945998369] : Registered SPIR-V module 0x555557facfe0, source-binary=0x555555559000
CHIP debug [TID 111945] [1703150689.946005502] : chipstar::Module 0x555557facfe0: Register variable (0x55555555b0f8) size=4, name=__chipspv_abort_called
CHIP debug [TID 111945] [1703150689.946017766] : hipGetCHIPBackend
CHIP debug [TID 111945] [1703150689.946031781] : Queue() 0x555557f9f4f0
CHIP debug [TID 111945] [1703150689.946120505] : 0x555557fa03a0 Device::addQueue(0x555557f9f4f0)
CHIP debug [TID 111945] [1703150689.946124050] : Queue 0x555557f9f4f0 added to the queue vector for device 0x555557fa03a0
CHIP debug [TID 111945] [1703150689.946128700] : hipGetBackendNativeHandles
[New Thread 0x7fffb95ff640 (LWP 111954)]

Thread 1 "hip_sycl_intero" received signal SIGSEGV, Segmentation fault.
0x00007fffc861c618 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
(gdb)
(gdb) bt
#0  0x00007fffc861c618 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#1  0x00007fffc7ef2ad9 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2  0x00007fffca123bc4 in enqueueMemCopyHelper(ur_command_t, ur_queue_handle_t_*, void*, unsigned char, unsigned long, void const*, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**, bool) () from /opt/intel/oneapi/compiler/2024.0/lib/libpi_level_zero.so
#3  0x00007fffca124767 in urEnqueueMemBufferWrite () from /opt/intel/oneapi/compiler/2024.0/lib/libpi_level_zero.so
#4  0x00007fffca14c09c in piEnqueueMemBufferWrite () from /opt/intel/oneapi/compiler/2024.0/lib/libpi_level_zero.so
#5  0x00007ffff603c4e5 in sycl::_V1::detail::copyH2D(sycl::_V1::detail::SYCLMemObjI*, char*, std::shared_ptr<sycl::_V1::detail::queue_impl>, unsigned int, sycl::_V1::range<3>, sycl::_V1::range<3>, sycl::_V1::id<3>, unsigned int, _pi_mem*, std::shared_ptr<sycl::_V1::detail::queue_impl>, unsigned int, sycl::_V1::range<3>, sycl::_V1::range<3>, sycl::_V1::id<3>, unsigned int, std::vector<_pi_event*, std::allocator<_pi_event*> >, _pi_event*&, std::shared_ptr<sycl::_V1::detail::event_impl> const&) ()
   from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#6  0x00007ffff603df82 in sycl::_V1::detail::MemoryManager::copy(sycl::_V1::detail::SYCLMemObjI*, void*, std::shared_ptr<sycl::_V1::detail::queue_impl>, unsigned int, sycl::_V1::range<3>, sycl::_V1::range<3>, sycl::_V1::id<3>, unsigned int, void*, std::shared_ptr<sycl::_V1::detail::queue_impl>, unsigned int, sycl::_V1::range<3>, sycl::_V1::range<3>, sycl::_V1::id<3>, unsigned int, std::vector<_pi_event*, std::allocator<_pi_event*> >, _pi_event*&, std::shared_ptr<sycl::_V1::detail::event_impl> const&) ()
   from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#7  0x00007ffff60bbb91 in sycl::_V1::detail::MemCpyCommand::enqueueImp() () from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#8  0x00007ffff60b2f00 in sycl::_V1::detail::Command::enqueue(sycl::_V1::detail::EnqueueResultT&, sycl::_V1::detail::BlockingT, std::vector<sycl::_V1::detail::Command*, std::allocator<sycl::_V1::detail::Command*> >&) () from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#9  0x00007ffff60dca14 in sycl::_V1::detail::Scheduler::GraphProcessor::enqueueCommand(sycl::_V1::detail::Command*, std::shared_lock<std::shared_timed_mutex>&, sycl::_V1::detail::EnqueueResultT&, std::vector<sycl::_V1::detail::Command*, std::allocator<sycl::_V1::detail::Command*> >&, sycl::_V1::detail::Command*, sycl::_V1::detail::BlockingT) ()
   from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#10 0x00007ffff60dc9f7 in sycl::_V1::detail::Scheduler::GraphProcessor::enqueueCommand(sycl::_V1::detail::Command*, std::shared_lock<std::shared_timed_mutex>&, sycl::_V1::detail::EnqueueResultT&, std::vector<sycl::_V1::detail::Command*, std::allocator<sycl::_V1::detail::Command*> >&, sycl::_V1::detail::Command*, sycl::_V1::detail::BlockingT) ()
   from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#11 0x00007ffff60d7c7c in sycl::_V1::detail::Scheduler::enqueueCommandForCG(std::shared_ptr<sycl::_V1::detail::event_impl>, std::vector<sycl::_V1::detail::Command*, std::allocator<sycl::_V1::detail::Command*> >&, sycl::_V1::detail::BlockingT) () from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#12 0x00007ffff60d73f8 in sycl::_V1::detail::Scheduler::addCG(std::unique_ptr<sycl::_V1::detail::CG, std::default_delete<sycl::_V1::detail::CG> >, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, _pi_ext_command_buffer*, std::vector<unsigned int, std::allocator<unsigned int> > const&) () from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#13 0x00007ffff610fdb2 in sycl::_V1::handler::finalize() () from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#14 0x00007ffff6099b51 in void sycl::_V1::detail::queue_impl::finalizeHandler<sycl::_V1::handler>(sycl::_V1::handler&, sycl::_V1::detail::CG::CGTYPE const&, sycl::_V1::event&) ()
   from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#15 0x00007ffff6099551 in sycl::_V1::detail::queue_impl::submit_impl(std::function<void (sycl::_V1::handler&)> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, sycl::_V1::detail::code_location const&, std::function<void (bool, bool, sycl::_V1::event&)> const*) () from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#16 0x00007ffff613d106 in sycl::_V1::detail::queue_impl::submit(std::function<void (sycl::_V1::handler&)> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, sycl::_V1::detail::code_location const&, std::function<void (bool, bool, sycl::_V1::event&)> const*) () from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#17 0x00007ffff613d0c5 in sycl::_V1::queue::submit_impl(std::function<void (sycl::_V1::handler&)>, sycl::_V1::detail::code_location const&) ()
   from /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7
#18 0x00007ffff0c42ee0 in oneapi::mkl::gpu::launch_kernel_3D(int*, sycl::_V1::queue*, mkl_gpu_kernel_struct_t*, mkl_gpu_argument_t*, unsigned long*, unsigned long*, mkl_gpu_event_list_t*) () from /opt/intel/oneapi/mkl/2024.0/lib/libmkl_sycl_blas.so.4
#19 0x00007ffff0c38a11 in oneapi::mkl::gpu::have_binary_kernels(int*, sycl::_V1::queue*) () from /opt/intel/oneapi/mkl/2024.0/lib/libmkl_sycl_blas.so.4
#20 0x00007ffff15e82ec in oneapi::mkl::gpu::mkl_blas_gpu_sgemm_driver_sycl(int*, sycl::_V1::queue*, oneapi::mkl::gpu::blas_arg_buffer_t*, mkl_gpu_event_list_t*) ()
   from /opt/intel/oneapi/mkl/2024.0/lib/libmkl_sycl_blas.so.4
#21 0x00007ffff15d27b7 in oneapi::mkl::gpu::sgemm_sycl_internal(sycl::_V1::queue*, MKL_LAYOUT, MKL_TRANSPOSE, MKL_TRANSPOSE, long, long, long, float, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>*, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>*, long, float, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>*, long, oneapi::mkl::blas::compute_mode, long, long, long) () from /opt/intel/oneapi/mkl/2024.0/lib/libmkl_sycl_blas.so.4
#22 0x00007ffff15cfb86 in oneapi::mkl::gpu::sgemm_sycl(sycl::_V1::queue*, MKL_LAYOUT, MKL_TRANSPOSE, MKL_TRANSPOSE, long, long, long, float, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>*, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>*, long, float, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>*, long, oneapi::mkl::blas::compute_mode, long, long, long) () from /opt/intel/oneapi/mkl/2024.0/lib/libmkl_sycl_blas.so.4
#23 0x00007ffff2ead4ce in oneapi::mkl::blas::sgemm(sycl::_V1::queue&, MKL_LAYOUT, oneapi::mkl::transpose, oneapi::mkl::transpose, long, long, long, float, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, float, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, oneapi::mkl::blas::compute_mode) () from /opt/intel/oneapi/mkl/2024.0/lib/libmkl_sycl_blas.so.4
#24 0x00007ffff2eacfb0 in oneapi::mkl::blas::column_major::gemm(sycl::_V1::queue&, oneapi::mkl::transpose, oneapi::mkl::transpose, long, long, long, float, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, float, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, oneapi::mkl::blas::compute_mode) () from /opt/intel/oneapi/mkl/2024.0/lib/libmkl_sycl_blas.so.4
#25 0x00007ffff7e588b3 in onemkl_gemm (my_queue=..., A=<optimized out>, B=<optimized out>, C=0x555557f9f1b0, m=10, n=10, k=10, ldA=10, ldB=10, ldC=10,
    alpha=<error reading variable: That operation is not available on integers of more than 8 bytes.>,
    beta=<error reading variable: That operation is not available on integers of more than 8 bytes.>)
    at /home/pvelesko/chipStar/samples/hip_sycl_interop/onemkl_gemm_wrapper/onemkl_gemm_wrapper.cpp:66
#26 0x00007ffff7e59633 in oneMKLGemmTest (nativeHandlers=<optimized out>, hip_backend=<optimized out>, A=<optimized out>, B=<optimized out>, C=<optimized out>, M=<optimized out>,
    N=<optimized out>, K=<optimized out>, ldA=<optimized out>, ldB=<optimized out>, ldC=<optimized out>, alpha=<optimized out>, beta=<optimized out>)
    at /home/pvelesko/chipStar/samples/hip_sycl_interop/onemkl_gemm_wrapper/onemkl_gemm_wrapper.cpp:124
#27 0x00005555555566d6 in main () at /home/pvelesko/chipStar/samples/hip_sycl_interop/hip_sycl_interop.cpp:120
Sarbojit2019 commented 8 months ago

@pekka I can successfully compile interop samples on my PVC machine.

 ~ $ dpkg -l|grep -i "level zero"
ii  intel-level-zero-gpu                       1.3.26690.29-704~22.04                  amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero                                 1.12.0-693~22.04                        amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero-dev                             1.12.0-693~22.04                        amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
pjaaskel commented 8 months ago

@Sarbojit2019 yes, compilation works fine, but what about running ctest sycl?

pvelesko commented 8 months ago

downgraded icpx and MKL to 2023.2 and Level Zero JIT still fails but hip-sycl passes now.

pjaaskel commented 8 months ago

This is a regression between RC1 and RC3 (tested with unpatched LLVM 17). Verified that it works in RC1 (and patched LLVM 16):

ctest -R sycl
Test project /home/pjaaskel/src/chipStar_1_1/build
    Start 1362: sycl_chip_interop
1/4 Test #1362: sycl_chip_interop ................   Passed    1.14 sec
    Start 1363: sycl_chip_interop_usm
2/4 Test #1363: sycl_chip_interop_usm ............   Passed    1.14 sec
    Start 1364: hip_sycl_interop
3/4 Test #1364: hip_sycl_interop .................   Passed    0.77 sec
    Start 1365: hip_sycl_interop_no_buffers
4/4 Test #1365: hip_sycl_interop_no_buffers ......   Passed    0.77 sec

100% tests passed, 0 tests failed out of 4
Sarbojit2019 commented 8 months ago

@pvelesko, For me if I pass immediate command list flag as True it crashes. If I change it to false test passes. I will check internally.

sycl_queue = sycl::ext::oneapi::level_zero::make_queue(....., true, 1,......);

to sycl_queue = sycl::ext::oneapi::level_zero::make_queue(..., false, 1, .......);

pjaaskel commented 8 months ago

@Sarbojit2019 can you send a proper patch and I'll try if it helps here. I doubt. My problem is likely related to SYCL Plugin Interface not proxying the required multi-input extension to L0 (which is required in RC3 due to merging in the SPIR-V feature-based linking).