SYCL Device diagnosis missing

AByzhynar commented 1 year ago

Did not find any option in SYCL to get status of the device in runtime. Health, load etc... Or to perform SYCL device diagnosis. Is it possible to request this feature as it is highly demanded. Or please provide a link if it is already present.

kevin-harms commented 1 year ago

I agree with this. I think SYCL should expand into some management APIs since there are some common management capabilities like this that user would like abstracted.

keryell commented 1 year ago

The only feedback we have today is that you will get an exception when trying to use the device.

AByzhynar commented 1 year ago

@keryell What kind of exception we will get in this case? I mean what type of exception? So, will it be possible to recognize the device failure by exception?

keryell commented 1 year ago

It is not really possible to recognize the device failure in a standard way today unfortunately. I guess you will get a sycl::exception with a std::error_code sycl::errc::runtime, sycl::errc::kernel, sycl::errc::memory_allocation or sycl::errc::platform, according to when the device fails or to the implementation. Probably the implementation put an implementation-defined message in the what() of the exception too. You can also use the interoperability API of SYCL to get the native backend objects behind the scene and use the backend API to do the device diagnosis.

keryell commented 1 year ago

This is somehow related to internal https://gitlab.khronos.org/sycl/Specification/-/issues/641

AByzhynar commented 1 year ago

I understand that such an API implementation will be hardware specific. E.g. Nvidia has nvidia-smi NVSMI is a cross platform tool that supports all standard NVIDIA driver-supported Linux distros CUDA Best Practices Guide - NVIDIA-SMI Also there is an API to NVSMI with different language bindings: NVML But it is quite good example what would be nice to have in SYCL

etomzak commented 1 year ago

I'm very curious about the intended use cases for device diagnostics. Is it intended to be a development tool to help developers understand and improve the behavior of their code on a device? Is it intended to be used in production to allow an application to dynamically respond to the state of the device? Both?

@AByzhynar, would you be able to provide a few specific user stories of how someone might use device diagnostics? I can imagine some, but what I have in mind might be completely different from what you're thinking of.

AByzhynar commented 1 year ago

@etomzak E.g. in safety critical systems when you performing some calculations (kernels execution) and something goes wrong - - you need to know how to react. To wait or to make corrective measures and the runtime device diagnosis (its health state, its availability , its load etc.) will give you enough information to do that.

etomzak commented 1 year ago

The question I have is about the types of things that can go wrong, when is it even possible to recover from problems, what the appropriate responses should be, and most importantly, when should the response be within the SYCL runtime/application (as opposed to somewhere else in the system).

For example, if the fault is in the ...

Accelerator hardware -- Is the SYCL runtime/application the best place to recover from an accelerator hardware fault? If the application finds out that there has been an SEU that's been handled by ECC, is it the application's responsibility to respond somehow? If a functional block on the accelerator stops responding, is the SYCL runtime/application the best place to respond to that? Is it a good idea for the SYCL runtime/application to attempt to carry on if part of the hardware becomes unavailable?
SYCL kernel -- Currently, SYCL doesn't provide an error channel from kernel code to host code; i.e., there's no way to if (bad_condition) {return SYCL_ERR;} from a kernel. If a kernel is expected to check for precondition violations and to report errors, then the application developer needs to provide this (e.g., by having the kernel write a special error value in a buffer). Is this issue asking for a sanctioned way for SYCL kernels to return errors? This could be useful, but in general, accelerator kernels are written with a "narrow contract" (input is assumed to be valid), partly because validating input requires a lot of control flow, and accelerators generally aren't good at control flow.
Host CPU -- In this case, there's no hope for the SYCL runtime/application to recover, because the state of the software can't be guaranteed to be correct.
Device driver -- Again, the big question is, if the SYCL application finds out about a device driver fault, what can it do to reliably recover the situation?
SYCL runtime -- If the SYCL application finds out about a problem in the SYCL runtime, then perhaps it makes sense to try to recover from it. The problem here is that SYCL doesn't specify anything like the C++ exception safety guarantees. In general, if the SYCL runtime throws an exception, there are no guarantees about the state for the runtime afterwards. Has the state of the runtime been rolled back? Is the runtime in a different but valid state? Is the runtime in an invalid state? I imagine that it could be valuable for safety-critical systems if this were specified, but it's a lot of work. And there's still the question of, even if an application can recover from a fault, should it attempt to do so? In what use cases can the complexity of correctly diagnosing and recovering from a fault be justified?

TL;DR: I've spent a lot of time thinking about this problem, and so far I haven't been able to come up with a specific example of where better diagnostics would solve a concrete problem. That makes it hard to figure out what better diagnostics should even look like, never mind proposing a SYCL extension or a change to the core spec.

keryell commented 1 year ago

Interesting discussion. At the end this has to be put in a bigger picture with some system-level safety design outside of the SYCL only focus, such as using triple redundancy...

KhronosGroup / SYCL-Docs

SYCL Device diagnosis missing #348