Open mehmetyusufoglu opened 1 month ago
I tried calling abort()
from a SYCL kernel running on an Intel GPU, with oneAPI 2024.1, but it doesn't seem to work:
#include <cstdlib>
#include <exception>
#include <iostream>
#include <sycl.hpp>
// Exception handler for asynchronous SYCL exceptions
void exception_handler(sycl::exception_list exceptions) {
for (const auto& e : exceptions) {
try {
std::rethrow_exception(e);
} catch (const sycl::exception& e) {
std::cout << "Caught asynchronous SYCL exception: " << e.what() << std::endl;
}
}
}
int main() {
try {
// Create a SYCL queue with an exception handler
sycl::queue queue(exception_handler);
// Submit a command group to the queue
queue.submit([&](sycl::handler& cgh) {
// Define the kernel
cgh.single_task([=]() {
sycl::ext::oneapi::experimental::printf("%d\n", 1);
std::abort();
sycl::ext::oneapi::experimental::printf("%d\n", 2);
});
});
// Wait for the queue to finish executing
queue.wait_and_throw();
} catch (sycl::exception const& e) {
std::cout << "Caught synchronous SYCL exception: " << e.what() << std::endl;
} catch (std::exception const& e) {
std::cout << "Cuaght std exception:" << e.what() << '\n';
} catch (...) {
std::cout << "Caught unexpected exception\n";
}
return 0;
}
fails at runtime (when the JIT compiler is called) with
Caught synchronous SYCL exception: The program was built for 1 devices
Build program log for 'Intel(R) Data Center GPU Flex 170':
Module <0x3610e90>: Unresolved Symbol <abort>
Module <0x3610e90>: Unresolved Symbol <abort> -999 (Unknown PI error)
I suspect that abort()
(or std::abort()
) works on the OpenCL CPU backend because it finds the system-level function, but is not a portable feature of SYCL/oneAPI.
@fwyzard Thanks. I have started a CI run for TagGpuSyclIntel
case. Then I will start for TagCpuSycl
. I mean the accelerators. We will see.
I Tried on HAL computer now. There are no compile errors. But I got the same result with you at run-time.
spack load cuda@12.2
spack load intel-oneapi-compilers@2023.1.0
spack load intel-oneapi-tbb@2021.10.0
module load rocm-5.4.6
sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.15.3.0.20_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD EPYC 7452 32-Core Processor 3.0 [2023.15.3.0.20_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.52.28202.52]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.28202]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A30 0.0 [CUDA 12.4]
[ext_oneapi_hip:gpu:0] AMD HIP BACKEND, AMD Radeon RX 7900 XTX 0.0 [HIP 50422.80]
COMPILE
icpx -fsycl test_abort.cpp
RUN
yusufo81@hal8999:~$ ./a.out
Caught synchronous SYCL exception: The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
Module <0x39adf80>: Unresolved Symbol <abort>
Module <0x39adf80>: Unresolved Symbol <abort> -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
I've tried looking into the specification of SYCL and the implementation of oneAPI, and I suspect the only portable solution may be assert(false)
.
Hover
I'll bring it up with Intel at the next occasion - but for the time being, that seems better than nothing.
I've tried looking into the specification of SYCL and the implementation of oneAPI, and I suspect the only portable solution may be
assert(false)
.Hover
* on my laptop integrated GPU that actually does not do anything :-( * on a Flax 170 datacenter GPU it causes the program to abort in a way that cannot be caught.
I'll bring it up with Intel at the next occasion - but for the time being, that seems better than nothing.
Ok I have added assert(false)
, thanks a lot. I can not catch the exception at HAL computer so i did not test it in the test code.
1
2
AssertHandler::printMessage
test_abort.cpp:27: auto main()::(anonymous class)::operator()(sycl::handler &)::(anonymous class)::operator()() const: global id: [0,0,0], local id: [0,0,0] Assertion `false` failed.
can not catch the exception at HAL computer so i did not test it in the test code.
Unrelated to the test, but what is the HAL computer ?
can not catch the exception at HAL computer so i did not test it in the test code.
Unrelated to the test, but what is the HAL computer ?
A HPC system at HZDR :) Has different kind of gpus and easily configurable.
@mehmetyusufoglu please rebase against develop branch to fix the CI issues
Macros to make signalling an error from the device side to the host code.
In case of Cuda, Hip, Sycl a user defined message is added to the abort. For other backends
std::runtime_error
exception with the user message is thrown.Testing:
Tests could be done for Accs: CpuThreads and CpuBlocks by catching the
runtime_error
exceptions thrown during exec.Aborts can not be catched from Cuda, Hip, Sycl as we call exec. (Only tested by running a temporary fail test at CI) But for Cuda;
__trap
triggersruntime_error
can be catched during the wait(queue).For Hip, abort() is used. Which triggers __builtin_trap()
For sycl assert(false) is used. std::abort generated a runtime error at our local development system no compile error. https://github.com/alpaka-group/alpaka/pull/2283#issuecomment-2169733670
I turned out to be OpenMP specification mandates std::runtime errors should be handled by the same thread otherwise it is converted to abort. I checked with a signal handler and
SIGABRT
is fired. (Therefore openmp cases could not be tested other than a one time fail test at CI.)Issue
fix #2258