dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.58k stars 3.02k forks source link

[GraphBolt][CUDA] `cpu-cuda` optimization #7005

Open mfbalin opened 10 months ago

mfbalin commented 10 months ago

🔨Work Item

IMPORTANT:

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

We should consider moving the copy_to operation before the feature fetch stage so that the overlap optimization is enabled for the cpu-cuda mode as well. However, the features need to be pinned for this to work. On my machine, I got a 2.2x speedup by doing so.

if args.storage_device == "cpu":
    datapipe = datapipe.copy_to(device=device, extra_attrs=["input_nodes"])

@Rhett-Ying @frozenbugs

mfbalin commented 10 months ago

The main examples take two parameters, storage device and device. To take advantage of this optimization, we need to move the features to the pinned memory even if the storage device is cpu.

Should we add one more parameter to the args.mode that denotes the feature_storage? Then, we would have the following modes:

graph-features-device: cpu-cpu-cpu, cpu-cpu-cuda, cpu-pinned-cuda, pinned-pinned-cuda, cuda-pinned-cuda, cuda-cuda-cuda.

@frozenbugs @Rhett-Ying

mfbalin commented 10 months ago

Also, for each different mode of graph and feature storages, we need copy_to inserted. This makes the examples quite bulky. We could probably insert it only at the end and let the dataloader move it further up. This would also fix #6981.

mfbalin commented 7 months ago

@frozenbugs should we make this a release blocker? PyG advanced example has this already implemented so we simply need to update other examples in a similar manner.

https://github.com/dmlc/dgl/blob/c997434cdb97d386ba2ac1dfa0226e98df20c92d/examples/sampling/graphbolt/pyg/node_classification_advanced.py#L339-L352