dmlc / dlpack

common in-memory tensor structure
https://dmlc.github.io/dlpack/latest
Apache License 2.0
905 stars 133 forks source link

Add support for ORT device #121

Closed kyule7 closed 1 year ago

kyule7 commented 1 year ago

This PR extends DLDeviceType to support onnxruntime (ORT) device. The ORT device is an device abstract for the integration of ORT eager-mode execution into AI frameworks using dlpack (e.g. torch_ort).

tqchen commented 1 year ago

Thanks @kyule7 .

One note is that ORT itself is a software indirection layer rather than a driver layer.

While it is understandable to have such a layer. Such indirection also prevents sharing of the underlying device storage. For example, imagine ORT have a CUDA device under the hood, it cannot be shared with PyTorch's CUDA storage despite they both works leverages CUDA.

DLPack is an exchange format, so one possible criteria would be the framework adoption status, and possible public set of driver APIs(to copy, write/read memory and launch operators) on these devices. Additionally, one goal would be to have the device recognized by different frameworks, and for them to be able to exchange between each other via the same device abstraction. For the long term stability, having multiple frameworks using the lowest-level possible device abstraction would help (e.g. everyone exports as CUDA, or another NPU memory when they can also be used by another framework) when it is feasible.

kyule7 commented 1 year ago

Thanks @kyule7 .

One note is that ORT itself is a software indirection layer rather than a driver layer.

While it is understandable to have such a layer. Such indirection also prevents sharing of the underlying device storage. For example, imagine ORT have a CUDA device under the hood, it cannot be shared with PyTorch's CUDA storage despite they both works leverages CUDA.

DLPack is an exchange format, so one possible criteria would be the framework adoption status, and possible public set of driver APIs(to copy, write/read memory and launch operators) on these devices. Additionally, one goal would be to have the device recognized by different frameworks, and for them to be able to exchange between each other via the same device abstraction. For the long term stability, having multiple frameworks using the lowest-level possible device abstraction would help (e.g. everyone exports as CUDA, or another NPU memory when they can also be used by another framework) when it is feasible.

Hi @tqchen, I appreciate for the comments!

I like the goal about the device abstraction allows for being recognized by different frameworks and exchanged among them, but also see challenges due to different constraints/requirements from different devices. ORT is an indirection layer that optimizes and executes an AI model with a backend (CPU, CUDA, etc) as supported and configured for onnxruntime, as you mentioned. W.r.t. the CUDA example, tensors created by pytorch or onnxruntime can be exchanged to each other, and are converted through dlpack, which requires conversion support both in pytorch <-> dlpack and dlpack <-> onnxruntime. W.r.t. the ORT device mentioned in the PR description, it is about "ort" device supported and limited to pytorch, which is used as the dispatch key for ort eager-mode execution. Operators on "ort" device are mapped to a backend configured for ort. There are limited supports for tensor exchange between pytorch <-> ort (e.g. CPU/CUDA), and allowing ort device type in dlpack allows for a custom device mapped to the abstract device type, "ort", for dlpack/pytorch.

kyule7 commented 1 year ago

Closing the PR because the PR changes cannot be merged shortly.