KhronosGroup / SYCL-Docs

SYCL Open Source Specification
Other
110 stars 67 forks source link

SYCL Device diagnosis missing #348

Open AByzhynar opened 1 year ago

AByzhynar commented 1 year ago

Did not find any option in SYCL to get status of the device in runtime. Health, load etc... Or to perform SYCL device diagnosis. Is it possible to request this feature as it is highly demanded. Or please provide a link if it is already present.

kevin-harms commented 1 year ago

I agree with this. I think SYCL should expand into some management APIs since there are some common management capabilities like this that user would like abstracted.

keryell commented 1 year ago

The only feedback we have today is that you will get an exception when trying to use the device.

AByzhynar commented 1 year ago

@keryell What kind of exception we will get in this case? I mean what type of exception? So, will it be possible to recognize the device failure by exception?

keryell commented 1 year ago

It is not really possible to recognize the device failure in a standard way today unfortunately. I guess you will get a sycl::exception with a std::error_code sycl::errc::runtime, sycl::errc::kernel, sycl::errc::memory_allocation or sycl::errc::platform, according to when the device fails or to the implementation. Probably the implementation put an implementation-defined message in the what() of the exception too. You can also use the interoperability API of SYCL to get the native backend objects behind the scene and use the backend API to do the device diagnosis.

keryell commented 1 year ago

This is somehow related to internal https://gitlab.khronos.org/sycl/Specification/-/issues/641

AByzhynar commented 1 year ago

I understand that such an API implementation will be hardware specific. E.g. Nvidia has nvidia-smi NVSMI is a cross platform tool that supports all standard NVIDIA driver-supported Linux distros CUDA Best Practices Guide - NVIDIA-SMI Also there is an API to NVSMI with different language bindings: NVML But it is quite good example what would be nice to have in SYCL

etomzak commented 1 year ago

I'm very curious about the intended use cases for device diagnostics. Is it intended to be a development tool to help developers understand and improve the behavior of their code on a device? Is it intended to be used in production to allow an application to dynamically respond to the state of the device? Both?

@AByzhynar, would you be able to provide a few specific user stories of how someone might use device diagnostics? I can imagine some, but what I have in mind might be completely different from what you're thinking of.

AByzhynar commented 1 year ago

@etomzak E.g. in safety critical systems when you performing some calculations (kernels execution) and something goes wrong - - you need to know how to react. To wait or to make corrective measures and the runtime device diagnosis (its health state, its availability , its load etc.) will give you enough information to do that.

etomzak commented 1 year ago

The question I have is about the types of things that can go wrong, when is it even possible to recover from problems, what the appropriate responses should be, and most importantly, when should the response be within the SYCL runtime/application (as opposed to somewhere else in the system).

For example, if the fault is in the ...

TL;DR: I've spent a lot of time thinking about this problem, and so far I haven't been able to come up with a specific example of where better diagnostics would solve a concrete problem. That makes it hard to figure out what better diagnostics should even look like, never mind proposing a SYCL extension or a change to the core spec.

keryell commented 1 year ago

Interesting discussion. At the end this has to be put in a bigger picture with some system-level safety design outside of the SYCL only focus, such as using triple redundancy...