Closed oleksandr-pavlyk closed 3 years ago
@tqchen @leofang @rgommers Please provide feedback
Thanks, @oleksandr-pavlyk. One very high level (and perhaps naive, since I am not familiar with oneAPI) question: My understanding is oneAPI can target non-Intel devices, such as NVIDIA or AMD GPUs. What happens, then, when a oneAPI-based library targeting CUDA/HIP/OpenCL exchanges with another library implemented natively on one of these devices?
Say I have
kDLONEAPI_GPU
-->kDLCUDA
. Does an importer handle this in their DLPack implementation? Or is there a runtime flag somewhere that we can check and see if two libraries are actually running on an NVIDIA GPU? Or do we only allow exchanges between two oneAPI libraries (maybe context sharing is not trivial in oneAPI)?
I do mean to try this scenario out, hopefully later this week, so for now I will be speaking hypothetically.
Implementation of SYCL API is provided by a backend. Open source LLVM SYCL compiler has such a CUDA backend.
In principle, SYCL entities (such as SYCL device, SYCL context) allow one to retrieve the corresponding native object stored by the backend, using backend interoperability. This means, however, that for an importer, say cupy, to be able to import CUDA memory exported with device_type=kDLONEAPI_GPU
, one needs oneAPI runtime to check if the backend is sycl::backend::cuda
, and to get appropriate CUDART objects.
One would also need oneAPI runtime to check the type of USM allocation, since DLPack uses different device codes for USM-device allocation (kDLGPU
) vs. USM-shared allocation (I surmise this should correspond to kDLCUDAManaged
)
@tqchen The real purpose of this PR is to extend DLDeviceType
enum in dlpack/include/dlpack.h
.
The app was only added to demonstrate feasibility. The POC code relies on the platform default conteget xt extension implementation, which is implemented in open-source intel/llvm, but is not part of oneAPI 2021.3 and oneAPI 2021.4 yet.
An example of building dpctl
with open source llvm sycl bundle can be found in here.
dpctl
has just enabled use of this default context in IntelPython/dpctl#627 and so the POC should work now.
Please let me know if I should remove the code from apps for this PR to get ahead.
@oleksandr-pavlyk indeed it would be helpful to separate it out so folks can focus on the standard itself
@oleksandr-pavlyk indeed it would be helpful to separate it out so folks can focus on the standard itself
I force pushed, removing additions to apps/
folder.
Note that SYCL host devices are not represented in DLDeviceType
enum, but kDLCPU
can be safely used for host devices.
cc @leofang @csullivan please take another look
I see that the device, and therefore the device aspect, (type: cpu/gpu/accelerator) can be queried from a USM allocation. @oleksandr-pavlyk, could it be sufficient to then only introduce kDLOneAPI
?
I started wondering along this track based on your above comment about kDLCPU as it does seem strange to have both it and kDLOneAPI_CPU. If the only distinction comes down to whether data was allocated via a usm_allocator, then perhaps a single entry to the DLDeviceType could suffice.
I force pushed, removing additions to
apps/
folder.
I pushed the removed changes to a branch in my fork (https://github.com/oleksandr-pavlyk/dlpack/tree/app-from-usm-ndarray).
I see that the device, and therefore the device aspect, (type: cpu/gpu/accelerator) can be queried from a USM allocation. @oleksandr-pavlyk, could it be sufficient to then only introduce
kDLOneAPI
?
Good point. Formulating the queries requires sycl::context
. DLPack importer must reconstruct this context to get a copy of the same context that the exporter used. The queries made against a different context (even though addressing the same device) may be unable to come back with expected results.
#include <CL/sycl.hpp>
int main(void) {
sycl::device d( sycl::default_selector{} ); // create device
sycl::context ctx1(d);
sycl::context ctx2(d);
double *p = sycl::malloc_device<double>(1024, d, ctx2);
sycl::usm::alloc allocation_type = sycl::get_pointer_type(p, ctx1);
assert( allocation_type == sycl::usm::alloc::device);
sycl::free(p, ctx2);
return 0;
}
Now, compiling this
$ dpcpp a.cpp -o a.out
$ SYCL_DEVICE_FILTER=opencl:gpu ./a.out
a.out: a.cpp:13: int main(): Assertion `allocation_type == sycl::usm::alloc::device' failed.
Aborted (core dumped)
Identifying the common context is possible with oneAPI's sycl extensions. The two relevant extensions are filter selector to map device_id
to an actual root (unpartititoned) sycl::device
. The root devices are created by DPC++ runtime, all sycl::device
instances are references to these singletons.
The next ingredient is platform default context, which provides a canonical context to associate with any unpartitioned device. If both the DLPack exporter and the DLPack importer use this context the USM allocation created by the exporter can be accessed by the importer.
Now, answering your specific question, one can in fact use just one addition enum kDLOneAPI
.
inline sycl::device get_sycl_device_by_device_id(unsigned int device_id) {
return sycl::device( sycl::ext::oneapi::filter_selector{ std::to_string(device_id) } );
}
Mapping from a root device to device_id
is also well-defined, since sycl::get_devices()
provides a stable ordering on the same platform, and as relevant to DLPack, in the same process.
In [1]: import dpctl
In [2]: dpctl.SyclDevice().get_filter_string()
Out[2]: 'level_zero:gpu:0'
In [3]: dpctl.SyclDevice().get_filter_string(include_backend=False, include_device_type=False)
Out[3]: '6'
In [4]: dpctl.get_devices()[6] == dpctl.SyclDevice()
Out[4]: True
DLPack can not be used to hand off USM allocations created on sub-devices, or bound to non-canonical contexts.
I started wondering along this track based on your above comment about kDLCPU as it does seem strange to have both it and kDLOneAPI_CPU. If the only distinction comes down to whether data was allocated via a usm_allocator, then perhaps a single entry to the DLDeviceType could suffice.
My suggestion to add 3 enums entries was aiming to keep the device type explicit, but since DLPack is not being directly used by users, I am fine with using only 1 enum.
Interesting, thanks for the detailed explanation @oleksandr-pavlyk. So, IIUC say I can access CPU, Intel GPU, NVIDIA GPU, FPGA in the same process, sycl::get_devices()
can assign a unique device ID for each of these devices?
Another question with regard to how it's intended for oneAPI to use DLPack: Do you require manager_ctx
to hold a pointer to sycl::context
, or the look-up via extension as you suggested is sufficient?
@oleksandr-pavlyk Thank you for the detailed follow up on this. Given the flexibility to derive the sycl::device from the device id as you've shown, my preference would be to introduce the single DLDeviceType::kDLOneAPI as the initial OneAPI support in DLPack.
One more question for @oleksandr-pavlyk: Any chance you have verified the statements in https://github.com/dmlc/dlpack/pull/78#issuecomment-927953217?
I guess ultimately to decide whether we wanna keep 1 enum vs 3 enums we need to know:
kDLONEAPI_CPU
interact with kDLCPU
?kDLONEAPI_GPU
interact with kDLCUDA
, kDLROCM
and their host/managed counterparts, if any of them is used as the SYCL backend?kDLONEAPI_ACCELERATOR
interact with kDLOpenCL
?The same question also applies to the unified kDLONEAPI
.
Based on my reading it seems to me it's the easiest if we don't consider the interaction between oneAPI and other non-oneAPI-based frameworks (because oneAPI runtime is needed to look up). If it's the case then indeed a single kDLONEAPI
would be sufficient. But it's best to consider all potential possibilities before moving forward.
The upside of keeping just DLDeviceType::kDLOneAPI
is simplicity. The downside is that OneAPI run-time is required to query a device type (is it a CPU device, or a GPU device).
USM allocations made on kDLONEAPI_CPU
sycl-device are accessible by any host application, similar to kDLCPU
, but also allow for synchronizations if exporter and importer both use oneAPI runtime.
Allocations made on DLDeviceType::kDLOneAPI
devices are USM-based, hence they do not interoperate with kDLOpenCL
allocations which pass cl_mem
objects (akin to sycl::buffer
), rather than pointers.
If the oneAPI device has CUDA backend (sycl::device::get_backend() == sycl::backend::cuda
), one can use the USM pointer in CUDA RT functions provided that the CUDA context stored in the sycl::context
to which the USM allocation is bound, retrievable via sycl::get_native<sycl::backend::cuda>(ctx)
, is the CUDA device's primary context for this device (the one returned by cuCtxGetCurrent
). If not, the importer may need to use cuCtxPushCurrent
to make it current.
@leofang I have expanded the comment in two ways: 1. to indicate that DLPack is sharing USM allocation, 2. to note that oneAPI runtime call is required to learn more about the device type as well as the USM allocation type.
If the oneAPI device has CUDA backend (
sycl::device::get_backend() == sycl::backend::cuda
), one can use the USM pointer in CUDA RT functions provided that the CUDA context stored in thesycl::context
to which the USM allocation is bound, retrievable viasycl::get_native<sycl::backend::cuda>(ctx)
, is the CUDA device's primary context for this device (the one returned bycuCtxGetCurrent
). If not, the importer may need to usecuCtxPushCurrent
to make it current.
Just for my own curiosity, if we store a sycl::context
pointer in the DLManagedTensor.manager_ctx
field, would it allow us to bypass oneAPI runtime calls when interfacing with CUDA/HIP?
If the oneAPI device has CUDA backend (
sycl::device::get_backend() == sycl::backend::cuda
), one can use the USM pointer in CUDA RT functions provided that the CUDA context stored in thesycl::context
to which the USM allocation is bound, retrievable viasycl::get_native<sycl::backend::cuda>(ctx)
, is the CUDA device's primary context for this device (the one returned bycuCtxGetCurrent
). If not, the importer may need to usecuCtxPushCurrent
to make it current.Just for my own curiosity, if we store a
sycl::context
pointer in theDLManagedTensor.manager_ctx
field, would it allow us to bypass oneAPI runtime calls when interfacing with CUDA/HIP?
I do not think so. Runtime is still needed to retrieve the native object sycl::get_native<sycl::backend::cuda>(sycl_ctx)
.
@tqchen I believe we're good to go? 🙂
Thanks everyone this is merged!
This PR proposes to extend
DLDeviceType
enum with 3 new entries:kDLONEAPI_GPU
,kDLONEAPI_CPU
, andkDLONEAPI_ACCELERATOR
.This adds DLPack support for OneAPI SYCL root-devices, addressable with filter-selector, e.g.
sycl::ext::oneapi::filter_selector("gpu:device_id")
forkDLONEAPI_GPU
devices.Two parties wishing to zero-copy exchange USM allocations using DLPack need to bind their allocations to the default platform context (implicitly used by
sycl::queue(dev)
constructor), ensuring that both parties use the samesycl::context
associated with the agreed upon SYCL device, thus ensuring that USM allocations made by one party are accessible to another.An application
apps/from_usm_ndarray
is included in this PR to demonstrate working prototype, compiled with DPC++ 2021.3, or Open Source LLVM-Sycl compiler release 2021-07.