For a sub-device's kernel_bundle, should it contain the sub-device in the context or is root-device enough?

KhronosGroup / SYCL-Docs

SYCL Open Source Specification

Other

113 stars 68 forks source link

For a sub-device's kernel_bundle, should it contain the sub-device in the context or is root-device enough? #313

Open aelovikov-intel opened 2 years ago

aelovikov-intel commented 2 years ago

One of the queues constructors contains the following:

This device must either be contained by syclContext or it must be a descendent device of some device that is contained by that context

However, it's not clear from the spec if the same is applicable for kernel_bundles. The closest I found is

4.11.1. Overview A kernel bundle is a high-level abstraction which represents a set of kernels that are associated with a context and can be executed on a number of devices, where each device is associated with that same context

Does "associated with that same context" include sub-devices of the devices in the context? If so, where is that term formally defined?

TApplencourt commented 2 years ago

It makes sense to me that a kernel bundle for a root device, should be able to run on the subdevices (to be coherent with the queue constructor). If people agree, should we rephrase it to something like:

A kernel bundle is a high-level abstraction which represents a set of kernels that are associated with a context and can be executed in all device and descendent devices associated with this context.

steffenlarsen commented 2 years ago

It makes sense to me that a kernel bundle for a root device, should be able to run on the subdevices (to be coherent with the queue constructor). If people agree, should we rephrase it to something like:

A kernel bundle is a high-level abstraction which represents a set of kernels that are associated with a context and can be executed in all device and descendent devices associated with this context.

I worry that it may be read as the descendants being actually associated with the context, rather than just descendants of devices associated with the context. Maybe just making the distinction clearer with something like;

... can be executed on all device associated with this context and on all descendant devices thereof.

gmlueck commented 2 years ago

It makes sense to me that a kernel bundle for a root device, should be able to run on the subdevices (to be coherent with the queue constructor).

This was not my intent when I wrote this part of the spec. I was thinking that the application needed to get a kernel bundle for the device on which they wanted to run the kernel, even if that device is a sub-device. My rationale was that an implementation might have a root device that is composed of two sub-devices that have different ISAs. This might be the case, for example, in an embedded system with a specialized accelerator.

I think @Pennycook has also made a case for when the compiler might generate different code for a root device vs. a sub-device. (I think this had something to do with the case where the root device and the sub-device have visibility to different hierarchies in the cache, but I'd need John to refresh my memory.)

Note, however, that it is still legal to get a kernel bundle for a sub-device without creating a new context. For example, see this description of get_kernel_bundle:

Throws:

An exception with the errc::invalid error code if any of the devices in devs is not one of devices contained by the context ctxt or is not a descendent device of some device in ctxt.

Despite all of this, I still think that the implementation might not need to recompile a kernel bundle when the application asks for a version of the kernel bundle for a sub-device. If the implementation knows that the same code can be used for both the root device and the sub-devices, it can simply cache the device image from the root device and then reuse it for any sub-devices.

I do admit, though, that this API might be inconvenient for an application that dynamically creates sub-devices and also uses the kernel bundle API. Such an application would need to get a new kernel bundle each time they created a new sub-device. If we feel this is too inconvenient, we could discuss changing the spec to provide a guarantee that a kernel bundle obtained for a root device is automatically usable on any sub-devices. I think my case of different ISAs is probably unusual. However, @Pennycook's rationale might be a good reason not to change the spec.

Pennycook commented 2 years ago

I think @Pennycook has also made a case for when the compiler might generate different code for a root device vs. a sub-device. (I think this had something to do with the case where the root device and the sub-device have visibility to different hierarchies in the cache, but I'd need John to refresh my memory.)

That's right. There are parts of the SYCL (and OpenCL) memory models that are defined in terms of "the device"; with an appropriate scope, atomics only have to protect against race conditions within "the device" and fences only have to guarantee visibility within "the device". A compiler targeting a sub-device may choose to alter/elide certain atomics or fences as an optimization, and the resulting binary wouldn't necessarily run correctly on the root device.

I do admit, though, that this API might be inconvenient for an application that dynamically creates sub-devices and also uses the kernel bundle API. Such an application would need to get a new kernel bundle each time they created a new sub-device. If we feel this is too inconvenient, we could discuss changing the spec to provide a guarantee that a kernel bundle obtained for a root device is automatically usable on any sub-devices. I think my case of different ISAs is probably unusual. However, @Pennycook's rationale might be a good reason not to change the spec.

Is there currently any restriction that a kernel bundle must contain exactly one "binary" per device? Is "binary" well-defined in that case?

If it were legal to have more than one binary (or a fat binary) per device, I think it would be possible to have the desired semantics without relinquishing the current optimization opportunities, by saying:

A kernel bundle compiled for the root device must be compatible with the sub-device
A kernel bundle compiled for the sub-device is not necessarily compatible with the root device
A kernel bundle compiled for the root device may contain multiple "binaries" for a single kernel (or may be a fat binary containing code for each sub-device).

In the case I was talking about, a compiler that didn't optimize for the sub-device case would have the choice to produce a single unoptimized binary that worked across the root and all sub-devices, or a fat binary of optimized and unoptimized kernels; in the mixed ISA case, the compiler could compile different binaries for each ISA and put them in the same bundle.

aelovikov-intel commented 2 years ago

@Pennycook , for your suggestion, can/should we restrict it such that compilation for sub-device and for the root device results in the same performance when running on that particular sub-device?

gmlueck commented 2 years ago

@Pennycook

Yes, it is definitely OK for a kernel bundle to contain multiple device images, where there is a different image for each device. In fact, this is quite normal. Your proposal is definitely workable, but it means that an implementation may need to do extra compilations that the application won't need.

Consider a case where the application compiled a kernel bundle for a root device. Assuming the implementation wants to optimize when possible, it would need to compile multiple versions of the bundle: one for the root device and another for the sub-device (and maybe more for sub-sub-devices, etc.) This effort will be wasted if the application doesn't ever try to run the bundle on a sub-device. Of course, the implementation can't look into the future, so it has to compile all versions that might be needed.

Talking through this scenario makes me think that we should not change the current intent of the spec.

Pennycook commented 2 years ago

@Pennycook , for your suggestion, can/should we restrict it such that compilation for sub-device and for the root device results in the same performance when running on that particular sub-device?

@aelovikov-intel: I wasn't imagining that we'd make any performance guarantees, no. I imagined the opposite: it would be valid for an implementation to use the device image from the root device to cover all sub-devices, and in practice developers would have to opt-in to compiling for a sub-device if they wanted to be sure the code was optimized for their use-case.

This effort will be wasted if the application doesn't ever try to run the bundle on a sub-device. Of course, the implementation can't look into the future, so it has to compile all versions that might be needed.

@gmlueck: I completely agree. The compilation cost would be prohibitively expensive if a device had to consider all possible configurations.

My assumption was that in the performance case, the user would be deliberately creating sub-devices as an optimization, and so would be open to creating kernel bundles specifically for those sub-devices too. If an implementation decided to optimize, it would probably need some heuristics to decide which kernels were worth optimizing and/or which sub-device configurations were popular. In practice, I wouldn't expect many implementations to optimize for sub-devices, just because of the complexity.

But in the mixed ISA case, the kernel bundle containing both device images seems necessary to me, because if a kernel was submitted to a mixed ISA root device then the user would have no way of knowing (or controlling) which ISA(s) were used.

To be clear: I'm happy to leave the specification as it is, but wanted to provide further details on the proposal I made!

gmlueck commented 2 years ago

I imagined the opposite: it would be valid for an implementation to use the device image from the root device to cover all sub-devices, and in practice developers would have to opt-in to compiling for a sub-device if they wanted to be sure the code was optimized for their use-case.

This seems to me like it would lead to confusion. In this model, it would be legal for an application to compile a bundle for the root device and then run on a sub-device. However, the expectation is that this would lead to reduced performance on many implementations because the version of the kernel that is compiled for the root device might not be optimal for the sub-device. My guess is that many people would end up running with reduced performance and not realizing it.

Since kernel bundles are an advanced feature, l think customers would prefer an API that guides them to the optimal solution. Wouldn't it be better if the implementation gave an error in the case that was potentially non-optimal (i.e. compiling for the root device and running with the sub-device)?

But in the mixed ISA case, the kernel bundle containing both device images seems necessary to me, because if a kernel was submitted to a mixed ISA root device then the user would have no way of knowing (or controlling) which ISA(s) were used.

I agree with this. If an implementation exposed a "mega-device" with two different ISA cores, then compiling a bundle for that mega-device might require compiling for both cores. I think it would depend on how the submission logic works. If the submission logic could run the kernel on either core, then the bundle would need to contain a device image for each core. If the submission logic could only run certain kernels on certain cores, then the bundle would only need to contain a device image for the cores on which the kernel could potentially run.

If vendor decides to expose a multi-ISA mega-device like this, it would need to think through these issues.

TApplencourt commented 2 years ago

Since kernel bundles are an advanced feature, l think customers would prefer an API that guides them to the optimal solution

Users will face this issue when using libraries that use kernel bundle (as the issue linked shows). I'm under the impression that we don't change the working in the spec; the libraries will need to generate one kernel bundle per device/subdevice/... just to be flexible.

tomdeakin commented 2 years ago

Can we allow both via fallback to root? Is this implementation detail vs guarantee? sub-device part of a device, not a new device. FPGA sub-device might not be translatable.

Most flexibility if say kernel bundle needs to be created with sub-device in the list. Implementations could just reuse if the could. Dynamically creating sub-devices with kernel bundle that was created in advance is challenging with that approach.

Need discussion/input from kernel bundle users.

gmlueck commented 2 years ago

Note to myself: I took the AR to follow up with the oneDPL team that reported the bug listed at the top of this issue.