dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.33k stars 3k forks source link

[RFC] Intel GPU support #7195

Open RafLit opened 5 months ago

RafLit commented 5 months ago

🚀 Feature

Support for a new device in DGL - Intel GPU. The POC of Intel GPU support in DGL for GraphSAGE model is available here: https://github.com/RafLit/dgl/tree/xpu_poc.

Motivation

Intel GPU is not supported by DGL yet. Other graph deep learning framework supports Intel GPU device and support in DGL could benefit users.

Pitch

Use of DGL on Intel GPU could look like this:

import torch as th
import intel_extension_for_pytorch
import dgl
device = 'xpu'
g = dgl.rand_graph(5, 8).to(device)
feat = th.randn((g.num_nodes(), 4)).to(device)
conv = dgl.nn.SAGEConv(4,1,'gcn').to(device)
res = conv(g, feat)

PyTorch Tensors can be converted to DGL's NDArrays through TensorAdapter extended with XPU support. PyTorch XPU Tensor memory can be transferred to DGL through the DLPack API.

DGL NDArrays and Graphs can be moved to XPU through the XPU_Device_API (a part of the linked POC) that uses a sycl queue fetched from IPEX to copy memory. The sycl queue corresponding to the XPU device can be fetched through IPEX C++ API with sycl::queue& queue = xpu::get_queue_from_stream(c10_stream);. New NDArrays can be allocated on the XPU through the Device API.

In the POC we implemented operations used by GPU device by the GraphSAGE example. We are going to implement additional ops in order to enable standard GNN models like GCN, GAT, R-GCN and R-GAT. We are planning to implement accelerated GNN ops in external linear algebra library, XeTLA, that can be added as a third-party submodule.

Additional context

New device would need to be integrated into existing device dispatch mechanism. This creates a challenge to support various device configurations. Currently DGL uses ATEN_XPU_SWITCH and ATEN_XPU_SWITCH_CUDA macros to dispatch to specific device and DGL_USE_CUDA flag is used to indicate whether CUDA is to be utilized. Implementing support for new device to current dispatch is associated with a problem of either adding multiple new macros for each possible configuration or adding new device to existing CUDA switch and implementing dummy functions for each supported operator. It would be beneficial for us to know which of these would be a preferable solution, or whether a more expandable solution is possible to implement.

greatzyq525 commented 5 months ago

@Linaom1214

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you