CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.
Other
157 stars 26 forks source link

libCEED JIT Failures #562

Open pvelesko opened 11 months ago

pvelesko commented 11 months ago

Using /gpu/hip/gen

Test Summary Report
-------------------
t550-operator       (Wstat: 0 Tests: 1 Failed: 1)
  Failed test:  1
t552-operator       (Wstat: 0 Tests: 1 Failed: 1)
  Failed test:  1
t554-operator       (Wstat: 0 Tests: 1 Failed: 1)
  Failed test:  1

Level Zero Failures:

CHIP error [TID 17835] [1690448929.470865149] : hipErrorLaunchFailure (Failed to find kernel via kernel name: CeedKernelHipGenOperator_Scale) in /home/pvelesko/chipStar/main/src/CHIPBackend.cc:265:getKernelByName

CHIP error [TID 17835] [1690448929.470951368] : Caught Error: hipErrorLaunchFailure
/home/pvelesko/libCEED/backends/hip/ceed-hip-compile.cpp:125 in CeedGetKernelHip(): hipErrorLaunchFailure
Aborted (core dumped)

OpenCL in these cases either hangs:

CHIP_BE=opencl ./build/t550-operator /gpu/hip/gen
timeout

Thread 1 "t550-operator" received signal SIGINT, Interrupt.
0x000015552a68890b in ?? () from /soft/libraries/intel-gpu-umd/20230622.1-agama-devel-647.9/driver/lib64/intel-opencl/libigdrcl.so
(gdb) btr
Undefined command: "btr".  Try "help".
(gdb) bt
#0  0x000015552a68890b in ?? () from /soft/libraries/intel-gpu-umd/20230622.1-agama-devel-647.9/driver/lib64/intel-opencl/libigdrcl.so
#1  0x000015552a824236 in ?? () from /soft/libraries/intel-gpu-umd/20230622.1-agama-devel-647.9/driver/lib64/intel-opencl/libigdrcl.so
#2  0x000015552a621300 in ?? () from /soft/libraries/intel-gpu-umd/20230622.1-agama-devel-647.9/driver/lib64/intel-opencl/libigdrcl.so
#3  0x000015552a822b7a in ?? () from /soft/libraries/intel-gpu-umd/20230622.1-agama-devel-647.9/driver/lib64/intel-opencl/libigdrcl.so
#4  0x000015552a5c66fa in ?? () from /soft/libraries/intel-gpu-umd/20230622.1-agama-devel-647.9/driver/lib64/intel-opencl/libigdrcl.so
#5  0x0000155554c54428 in cl::CommandQueue::finish (this=0x555557ee8610) at /home/pvelesko/chipStar/main/include/CL/opencl.hpp:8806
#6  0x0000155554c49994 in CHIPQueueOpenCL::finish (this=0x555557476ee0) at /home/pvelesko/chipStar/main/src/backend/OpenCL/CHIPBackendOpenCL.cc:1181
#7  0x0000155554b84a92 in chipstar::Queue::memCopy (this=0x555557476ee0, Dst=0x5555585397f0, Src=0xff9055552ed000, Size=496) at /home/pvelesko/chipStar/main/src/CHIPBackend.cc:1559
#8  0x0000155554bf58d0 in hipMemcpyInternal (Dst=0x5555585397f0, Src=0xff9055552ed000, SizeBytes=496, Kind=hipMemcpyDeviceToHost) at /home/pvelesko/chipStar/main/src/CHIPBindings.cc:2981
#9  0x0000155554bf592d in hipMemcpy (Dst=0x5555585397f0, Src=0xff9055552ed000, SizeBytes=496, Kind=hipMemcpyDeviceToHost) at /home/pvelesko/chipStar/main/src/CHIPBindings.cc:2989
#10 0x0000155555165a03 in CeedVectorSyncD2H_Hip (vec=0x555557f218a0) at /home/pvelesko/libCEED/backends/hip-ref/ceed-hip-ref-vector.c:93
#11 0x0000155555163604 in CeedVectorSyncArray_Hip (vec=0x555557f218a0, mem_type=CEED_MEM_HOST) at /home/pvelesko/libCEED/backends/hip-ref/ceed-hip-ref-vector.c:109
#12 0x000015555510011c in CeedVectorSyncArray (vec=0x555557f218a0, mem_type=CEED_MEM_HOST) at /home/pvelesko/libCEED/interface/ceed-vector.c:311
#13 0x0000155555165ddf in CeedVectorGetArrayCore_Hip (vec=0x555557f218a0, mem_type=CEED_MEM_HOST, array=0x7fffffff70b0) at /home/pvelesko/libCEED/backends/hip-ref/ceed-hip-ref-vector.c:367
#14 0x00001555551637b3 in CeedVectorGetArrayRead_Hip (vec=0x555557f218a0, mem_type=CEED_MEM_HOST, array=0x7fffffff70b0) at /home/pvelesko/libCEED/backends/hip-ref/ceed-hip-ref-vector.c:386
#15 0x0000155555100351 in CeedVectorGetArrayRead (vec=0x555557f218a0, mem_type=CEED_MEM_HOST, array=0x7fffffff70b0) at /home/pvelesko/libCEED/interface/ceed-vector.c:410
#16 0x00005555555558f7 in main (argc=2, argv=0x7fffffff72e8) at /home/pvelesko/libCEED/tests/t550-operator.c:106
(gdb)

or fails with the following error:

/home/pvelesko/libCEED/interface/ceed-jit-tools.c:101 in CeedLoadSourceToInitializedBuffer(): Couldn't read source file: /home/pvelesko/libCEED/include/ceed/jit-source/gallery/ceed-scale.h
pvelesko commented 11 months ago

I was able to resolve the JIT failures by using an older runtime:

urrently Loaded Modules:
  1) mpich/51.2/icc-all-pmix-gpu   3) cray-pals/1.2.12      5) prepend-deps/default   7) cmake/3.26.4   9) gdb/13.1                          11) HIP/hipBLAS/chip-spv-latest  13) intel_compute_runtime/release/agama-devel-627
  2) libfabric/1.15.2.0            4) cray-libpals/1.2.12   6) append-deps/default    8) gcc/12.1.0    10) HIP/chipStar/llvm15/latest/debug  12) clang/clang15-spirv-omp      14) oneapi/eng-compiler/2023.05.15.003
pvelesko commented 10 months ago

@pengtu would providing the SPIR-V suffice for the reproducer?

pengtu commented 10 months ago

Yes

Peng


From: Paulius Velesko @.> Sent: Tuesday, August 22, 2023 3:05:48 AM To: CHIP-SPV/chipStar @.> Cc: Peng Tu @.>; Mention @.> Subject: Re: [CHIP-SPV/chipStar] libCEED JIT Failures (Issue #562)

@pengtuhttps://github.com/pengtu would providing the SPIR-V suffice for the reproducer?

— Reply to this email directly, view it on GitHubhttps://github.com/CHIP-SPV/chipStar/issues/562#issuecomment-1687889154, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AATZITGDJ625FIIFNWKEU5TXWR77ZANCNFSM6AAAAAA2Z3VDH4. You are receiving this because you were mentioned.Message ID: @.***>

pvelesko commented 10 months ago
pvelesko@x1921c6s5b0n0:~/libCEED> ./build/t550-operator /gpu/hip/gen

Computed Area Coarse Grid: 0.000000 != True Area: 2.0
Computed Area Fine Grid: 0.000000 != True Area: 2.0
CHIP error [TID 5152] [1692797160.862680477] : hipErrorLaunchFailure (Failed to find kernel via kernel name: CeedKernelHipGenOperator_Scale) in /home/pvelesko/chipStar/main/src/CHIPBackend.cc:269:getKernelByName

CHIP error [TID 5152] [1692797160.866513927] : Caught Error: hipErrorLaunchFailure
/home/pvelesko/libCEED/backends/hip/ceed-hip-compile.cpp:125 in CeedGetKernelHip(): hipErrorLaunchFailure
Aborted (core dumped)

clinfo driver version: 23.17.26241.22

spirv.zip

Attached are the two SPIR-V files that have CeedKernelHipGenOperator_Scale in them.

Failing for runtime 647, passing for runtime 627 but giving a correctness error that might be unrelated to the runtime.

pvelesko commented 10 months ago

@pengtu

pvelesko commented 10 months ago

@pengtu Can you confirm that you received the SPIR-V and it's sufficient?

pvelesko commented 9 months ago

@pengtu

pengtu commented 9 months ago

Filed the issue to Intel compute runtime:

https://github.com/intel/compute-runtime/issues/683

pengtu commented 8 months ago

@pvelesko: do you have the module binary that can be shared with the GPU driver team. Please check the request in the tracking issue filed to compute-runtime above.