Add oneapi support - Githubissues

oleksandr-pavlyk commented 3 years ago

This PR proposes to extend DLDeviceType enum with 3 new entries: kDLONEAPI_GPU, kDLONEAPI_CPU, and kDLONEAPI_ACCELERATOR.

This adds DLPack support for OneAPI SYCL root-devices, addressable with filter-selector, e.g. sycl::ext::oneapi::filter_selector("gpu:device_id") for kDLONEAPI_GPU devices.

Two parties wishing to zero-copy exchange USM allocations using DLPack need to bind their allocations to the default platform context (implicitly used by sycl::queue(dev) constructor), ensuring that both parties use the same sycl::context associated with the agreed upon SYCL device, thus ensuring that USM allocations made by one party are accessible to another.

An application apps/from_usm_ndarray is included in this PR to demonstrate working prototype, compiled with DPC++ 2021.3, or Open Source LLVM-Sycl compiler release 2021-07.

oleksandr-pavlyk commented 3 years ago

@tqchen @leofang @rgommers Please provide feedback

oleksandr-pavlyk commented 3 years ago

Thanks, @oleksandr-pavlyk. One very high level (and perhaps naive, since I am not familiar with oneAPI) question: My understanding is oneAPI can target non-Intel devices, such as NVIDIA or AMD GPUs. What happens, then, when a oneAPI-based library targeting CUDA/HIP/OpenCL exchanges with another library implemented natively on one of these devices?

Say I have kDLONEAPI_GPU --> kDLCUDA. Does an importer handle this in their DLPack implementation? Or is there a runtime flag somewhere that we can check and see if two libraries are actually running on an NVIDIA GPU? Or do we only allow exchanges between two oneAPI libraries (maybe context sharing is not trivial in oneAPI)?

I do mean to try this scenario out, hopefully later this week, so for now I will be speaking hypothetically.

Implementation of SYCL API is provided by a backend. Open source LLVM SYCL compiler has such a CUDA backend.

In principle, SYCL entities (such as SYCL device, SYCL context) allow one to retrieve the corresponding native object stored by the backend, using backend interoperability. This means, however, that for an importer, say cupy, to be able to import CUDA memory exported with device_type=kDLONEAPI_GPU, one needs oneAPI runtime to check if the backend is sycl::backend::cuda, and to get appropriate CUDART objects.

One would also need oneAPI runtime to check the type of USM allocation, since DLPack uses different device codes for USM-device allocation (kDLGPU) vs. USM-shared allocation (I surmise this should correspond to kDLCUDAManaged)

oleksandr-pavlyk commented 3 years ago

@tqchen The real purpose of this PR is to extend DLDeviceType enum in dlpack/include/dlpack.h.

The app was only added to demonstrate feasibility. The POC code relies on the platform default conteget xt extension implementation, which is implemented in open-source intel/llvm, but is not part of oneAPI 2021.3 and oneAPI 2021.4 yet.

An example of building dpctl with open source llvm sycl bundle can be found in here.

dpctl has just enabled use of this default context in IntelPython/dpctl#627 and so the POC should work now.

Please let me know if I should remove the code from apps for this PR to get ahead.

tqchen commented 3 years ago

@oleksandr-pavlyk indeed it would be helpful to separate it out so folks can focus on the standard itself

oleksandr-pavlyk commented 3 years ago

@oleksandr-pavlyk indeed it would be helpful to separate it out so folks can focus on the standard itself

I force pushed, removing additions to apps/ folder.

Note that SYCL host devices are not represented in DLDeviceType enum, but kDLCPU can be safely used for host devices.

tqchen commented 3 years ago

cc @leofang @csullivan please take another look

csullivan commented 3 years ago

I see that the device, and therefore the device aspect, (type: cpu/gpu/accelerator) can be queried from a USM allocation. @oleksandr-pavlyk, could it be sufficient to then only introduce kDLOneAPI?

I started wondering along this track based on your above comment about kDLCPU as it does seem strange to have both it and kDLOneAPI_CPU. If the only distinction comes down to whether data was allocated via a usm_allocator, then perhaps a single entry to the DLDeviceType could suffice.

oleksandr-pavlyk commented 3 years ago

I force pushed, removing additions to apps/ folder.

I pushed the removed changes to a branch in my fork (https://github.com/oleksandr-pavlyk/dlpack/tree/app-from-usm-ndarray).

oleksandr-pavlyk commented 3 years ago

I see that the device, and therefore the device aspect, (type: cpu/gpu/accelerator) can be queried from a USM allocation. @oleksandr-pavlyk, could it be sufficient to then only introduce kDLOneAPI?

Good point. Formulating the queries requires sycl::context. DLPack importer must reconstruct this context to get a copy of the same context that the exporter used. The queries made against a different context (even though addressing the same device) may be unable to come back with expected results.

#include <CL/sycl.hpp>

int main(void) {
   sycl::device d( sycl::default_selector{} );  // create device

   sycl::context ctx1(d);
   sycl::context ctx2(d);

   double *p = sycl::malloc_device<double>(1024, d, ctx2);

   sycl::usm::alloc allocation_type = sycl::get_pointer_type(p, ctx1);

   assert( allocation_type == sycl::usm::alloc::device);
   sycl::free(p, ctx2);
   return 0;
}

Now, compiling this

$ dpcpp a.cpp -o a.out
$ SYCL_DEVICE_FILTER=opencl:gpu ./a.out
a.out: a.cpp:13: int main(): Assertion `allocation_type == sycl::usm::alloc::device' failed.
Aborted (core dumped)

Identifying the common context is possible with oneAPI's sycl extensions. The two relevant extensions are filter selector to map device_id to an actual root (unpartititoned) sycl::device. The root devices are created by DPC++ runtime, all sycl::device instances are references to these singletons.

The next ingredient is platform default context, which provides a canonical context to associate with any unpartitioned device. If both the DLPack exporter and the DLPack importer use this context the USM allocation created by the exporter can be accessed by the importer.

Now, answering your specific question, one can in fact use just one addition enum kDLOneAPI.

inline sycl::device get_sycl_device_by_device_id(unsigned int device_id) {
      return sycl::device( sycl::ext::oneapi::filter_selector{ std::to_string(device_id) } );
}

Mapping from a root device to device_id is also well-defined, since sycl::get_devices() provides a stable ordering on the same platform, and as relevant to DLPack, in the same process.

In [1]: import dpctl

In [2]: dpctl.SyclDevice().get_filter_string()
Out[2]: 'level_zero:gpu:0'

In [3]: dpctl.SyclDevice().get_filter_string(include_backend=False, include_device_type=False)
Out[3]: '6'

In [4]: dpctl.get_devices()[6] == dpctl.SyclDevice()
Out[4]: True

DLPack can not be used to hand off USM allocations created on sub-devices, or bound to non-canonical contexts.

I started wondering along this track based on your above comment about kDLCPU as it does seem strange to have both it and kDLOneAPI_CPU. If the only distinction comes down to whether data was allocated via a usm_allocator, then perhaps a single entry to the DLDeviceType could suffice.

My suggestion to add 3 enums entries was aiming to keep the device type explicit, but since DLPack is not being directly used by users, I am fine with using only 1 enum.

leofang commented 3 years ago

Interesting, thanks for the detailed explanation @oleksandr-pavlyk. So, IIUC say I can access CPU, Intel GPU, NVIDIA GPU, FPGA in the same process, sycl::get_devices() can assign a unique device ID for each of these devices?

Another question with regard to how it's intended for oneAPI to use DLPack: Do you require manager_ctx to hold a pointer to sycl::context, or the look-up via extension as you suggested is sufficient?

csullivan commented 3 years ago

@oleksandr-pavlyk Thank you for the detailed follow up on this. Given the flexibility to derive the sycl::device from the device id as you've shown, my preference would be to introduce the single DLDeviceType::kDLOneAPI as the initial OneAPI support in DLPack.

leofang commented 3 years ago

One more question for @oleksandr-pavlyk: Any chance you have verified the statements in https://github.com/dmlc/dlpack/pull/78#issuecomment-927953217?

I guess ultimately to decide whether we wanna keep 1 enum vs 3 enums we need to know:

How does kDLONEAPI_CPU interact with kDLCPU?
How does kDLONEAPI_GPU interact with kDLCUDA, kDLROCM and their host/managed counterparts, if any of them is used as the SYCL backend?
How does kDLONEAPI_ACCELERATOR interact with kDLOpenCL?

The same question also applies to the unified kDLONEAPI.

Based on my reading it seems to me it's the easiest if we don't consider the interaction between oneAPI and other non-oneAPI-based frameworks (because oneAPI runtime is needed to look up). If it's the case then indeed a single kDLONEAPI would be sufficient. But it's best to consider all potential possibilities before moving forward.

oleksandr-pavlyk commented 3 years ago

The upside of keeping just DLDeviceType::kDLOneAPI is simplicity. The downside is that OneAPI run-time is required to query a device type (is it a CPU device, or a GPU device).

USM allocations made on kDLONEAPI_CPU sycl-device are accessible by any host application, similar to kDLCPU, but also allow for synchronizations if exporter and importer both use oneAPI runtime.

Allocations made on DLDeviceType::kDLOneAPI devices are USM-based, hence they do not interoperate with kDLOpenCL allocations which pass cl_mem objects (akin to sycl::buffer), rather than pointers.

If the oneAPI device has CUDA backend (sycl::device::get_backend() == sycl::backend::cuda), one can use the USM pointer in CUDA RT functions provided that the CUDA context stored in the sycl::context to which the USM allocation is bound, retrievable via sycl::get_native<sycl::backend::cuda>(ctx), is the CUDA device's primary context for this device (the one returned by cuCtxGetCurrent). If not, the importer may need to use cuCtxPushCurrent to make it current.

oleksandr-pavlyk commented 3 years ago

@leofang I have expanded the comment in two ways: 1. to indicate that DLPack is sharing USM allocation, 2. to note that oneAPI runtime call is required to learn more about the device type as well as the USM allocation type.

leofang commented 3 years ago

If the oneAPI device has CUDA backend (sycl::device::get_backend() == sycl::backend::cuda), one can use the USM pointer in CUDA RT functions provided that the CUDA context stored in the sycl::context to which the USM allocation is bound, retrievable via sycl::get_native<sycl::backend::cuda>(ctx), is the CUDA device's primary context for this device (the one returned by cuCtxGetCurrent). If not, the importer may need to use cuCtxPushCurrent to make it current.

Just for my own curiosity, if we store a sycl::context pointer in the DLManagedTensor.manager_ctx field, would it allow us to bypass oneAPI runtime calls when interfacing with CUDA/HIP?

oleksandr-pavlyk commented 3 years ago

If the oneAPI device has CUDA backend (sycl::device::get_backend() == sycl::backend::cuda), one can use the USM pointer in CUDA RT functions provided that the CUDA context stored in the sycl::context to which the USM allocation is bound, retrievable via sycl::get_native<sycl::backend::cuda>(ctx), is the CUDA device's primary context for this device (the one returned by cuCtxGetCurrent). If not, the importer may need to use cuCtxPushCurrent to make it current.

Just for my own curiosity, if we store a sycl::context pointer in the DLManagedTensor.manager_ctx field, would it allow us to bypass oneAPI runtime calls when interfacing with CUDA/HIP?

I do not think so. Runtime is still needed to retrieve the native object sycl::get_native<sycl::backend::cuda>(sycl_ctx).

leofang commented 3 years ago

@tqchen I believe we're good to go? 🙂

tqchen commented 3 years ago

Thanks everyone this is merged!

dmlc / dlpack

Add oneapi support #78