alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
337 stars 69 forks source link

Throw error from device to host with a msg #2258 #2283

Open mehmetyusufoglu opened 1 month ago

mehmetyusufoglu commented 1 month ago

Macros to make signalling an error from the device side to the host code.

In case of Cuda, Hip, Sycl a user defined message is added to the abort. For other backends std::runtime_error exception with the user message is thrown.

Testing:

Issue

fix #2258

fwyzard commented 2 weeks ago

I tried calling abort() from a SYCL kernel running on an Intel GPU, with oneAPI 2024.1, but it doesn't seem to work:

#include <cstdlib>
#include <exception>
#include <iostream>

#include <sycl.hpp>

// Exception handler for asynchronous SYCL exceptions
void exception_handler(sycl::exception_list exceptions) {
    for (const auto& e : exceptions) {
        try {
            std::rethrow_exception(e);
        } catch (const sycl::exception& e) {
            std::cout << "Caught asynchronous SYCL exception: " << e.what() << std::endl;
        }
    }
}

int main() {
    try {
        // Create a SYCL queue with an exception handler
        sycl::queue queue(exception_handler);

        // Submit a command group to the queue
        queue.submit([&](sycl::handler& cgh) {
            // Define the kernel
            cgh.single_task([=]() {
                sycl::ext::oneapi::experimental::printf("%d\n", 1);
                std::abort();
                sycl::ext::oneapi::experimental::printf("%d\n", 2);
            });
        });

        // Wait for the queue to finish executing
        queue.wait_and_throw();
    } catch (sycl::exception const& e) {
        std::cout << "Caught synchronous SYCL exception: " << e.what() << std::endl;
    } catch (std::exception const& e) {
        std::cout << "Cuaght std exception:" << e.what() << '\n';
    } catch (...) {
        std::cout << "Caught unexpected exception\n";
    }

    return 0;
}

fails at runtime (when the JIT compiler is called) with

Caught synchronous SYCL exception: The program was built for 1 devices
Build program log for 'Intel(R) Data Center GPU Flex 170':
Module <0x3610e90>:  Unresolved Symbol <abort>
Module <0x3610e90>:  Unresolved Symbol <abort> -999 (Unknown PI error)

I suspect that abort() (or std::abort()) works on the OpenCL CPU backend because it finds the system-level function, but is not a portable feature of SYCL/oneAPI.

mehmetyusufoglu commented 2 weeks ago

@fwyzard Thanks. I have started a CI run for TagGpuSyclIntel case. Then I will start for TagCpuSycl. I mean the accelerators. We will see.

I Tried on HAL computer now. There are no compile errors. But I got the same result with you at run-time.

 spack load cuda@12.2
 spack load intel-oneapi-compilers@2023.1.0
 spack load intel-oneapi-tbb@2021.10.0
 module load rocm-5.4.6
 sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.15.3.0.20_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD EPYC 7452 32-Core Processor                 3.0 [2023.15.3.0.20_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.52.28202.52]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.28202]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A30 0.0 [CUDA 12.4]
[ext_oneapi_hip:gpu:0] AMD HIP BACKEND, AMD Radeon RX 7900 XTX 0.0 [HIP 50422.80]

COMPILE
 icpx -fsycl test_abort.cpp
RUN
yusufo81@hal8999:~$ ./a.out
Caught synchronous SYCL exception: The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A770 Graphics':
Module <0x39adf80>:  Unresolved Symbol <abort>
Module <0x39adf80>:  Unresolved Symbol <abort> -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
fwyzard commented 2 weeks ago

I've tried looking into the specification of SYCL and the implementation of oneAPI, and I suspect the only portable solution may be assert(false).

Hover

I'll bring it up with Intel at the next occasion - but for the time being, that seems better than nothing.

mehmetyusufoglu commented 2 weeks ago

I've tried looking into the specification of SYCL and the implementation of oneAPI, and I suspect the only portable solution may be assert(false).

Hover

* on my laptop integrated GPU that actually does not do anything :-(

* on a Flax 170 datacenter GPU it causes the program to abort in a way that cannot be caught.

I'll bring it up with Intel at the next occasion - but for the time being, that seems better than nothing.

Ok I have added assert(false) , thanks a lot. I can not catch the exception at HAL computer so i did not test it in the test code.

1
2
AssertHandler::printMessage
test_abort.cpp:27: auto main()::(anonymous class)::operator()(sycl::handler &)::(anonymous class)::operator()() const: global id: [0,0,0], local id: [0,0,0] Assertion `false` failed.
fwyzard commented 2 weeks ago

can not catch the exception at HAL computer so i did not test it in the test code.

Unrelated to the test, but what is the HAL computer ?

mehmetyusufoglu commented 2 weeks ago

can not catch the exception at HAL computer so i did not test it in the test code.

Unrelated to the test, but what is the HAL computer ?

A HPC system at HZDR :) Has different kind of gpus and easily configurable.

psychocoderHPC commented 1 week ago

@mehmetyusufoglu please rebase against develop branch to fix the CI issues