Open AByzhynar opened 1 year ago
I agree with this. I think SYCL should expand into some management APIs since there are some common management capabilities like this that user would like abstracted.
The only feedback we have today is that you will get an exception when trying to use the device.
@keryell What kind of exception we will get in this case? I mean what type of exception? So, will it be possible to recognize the device failure by exception?
It is not really possible to recognize the device failure in a standard way today unfortunately.
I guess you will get a sycl::exception
with a std::error_code
sycl::errc::runtime
, sycl::errc::kernel
, sycl::errc::memory_allocation
or sycl::errc::platform
, according to when the device fails or to the implementation.
Probably the implementation put an implementation-defined message in the what()
of the exception too.
You can also use the interoperability API of SYCL to get the native backend objects behind the scene and use the backend API to do the device diagnosis.
This is somehow related to internal https://gitlab.khronos.org/sycl/Specification/-/issues/641
I understand that such an API implementation will be hardware specific. E.g. Nvidia has nvidia-smi NVSMI is a cross platform tool that supports all standard NVIDIA driver-supported Linux distros CUDA Best Practices Guide - NVIDIA-SMI Also there is an API to NVSMI with different language bindings: NVML But it is quite good example what would be nice to have in SYCL
I'm very curious about the intended use cases for device diagnostics. Is it intended to be a development tool to help developers understand and improve the behavior of their code on a device? Is it intended to be used in production to allow an application to dynamically respond to the state of the device? Both?
@AByzhynar, would you be able to provide a few specific user stories of how someone might use device diagnostics? I can imagine some, but what I have in mind might be completely different from what you're thinking of.
@etomzak E.g. in safety critical systems when you performing some calculations (kernels execution) and something goes wrong - - you need to know how to react. To wait or to make corrective measures and the runtime device diagnosis (its health state, its availability , its load etc.) will give you enough information to do that.
The question I have is about the types of things that can go wrong, when is it even possible to recover from problems, what the appropriate responses should be, and most importantly, when should the response be within the SYCL runtime/application (as opposed to somewhere else in the system).
For example, if the fault is in the ...
if (bad_condition) {return SYCL_ERR;}
from a kernel. If a kernel is expected to check for precondition violations and to report errors, then the application developer needs to provide this (e.g., by having the kernel write a special error value in a buffer). Is this issue asking for a sanctioned way for SYCL kernels to return errors? This could be useful, but in general, accelerator kernels are written with a "narrow contract" (input is assumed to be valid), partly because validating input requires a lot of control flow, and accelerators generally aren't good at control flow.TL;DR: I've spent a lot of time thinking about this problem, and so far I haven't been able to come up with a specific example of where better diagnostics would solve a concrete problem. That makes it hard to figure out what better diagnostics should even look like, never mind proposing a SYCL extension or a change to the core spec.
Interesting discussion. At the end this has to be put in a bigger picture with some system-level safety design outside of the SYCL only focus, such as using triple redundancy...
Did not find any option in SYCL to get status of the device in runtime. Health, load etc... Or to perform SYCL device diagnosis. Is it possible to request this feature as it is highly demanded. Or please provide a link if it is already present.