Open mfbalin opened 10 months ago
The main examples take two parameters, storage device and device. To take advantage of this optimization, we need to move the features to the pinned memory even if the storage device is cpu.
Should we add one more parameter to the args.mode
that denotes the feature_storage
? Then, we would have the following modes:
graph-features-device
: cpu-cpu-cpu
, cpu-cpu-cuda
, cpu-pinned-cuda
, pinned-pinned-cuda
, cuda-pinned-cuda
, cuda-cuda-cuda
.
@frozenbugs @Rhett-Ying
Also, for each different mode of graph and feature storages, we need copy_to inserted. This makes the examples quite bulky. We could probably insert it only at the end and let the dataloader move it further up. This would also fix #6981.
@frozenbugs should we make this a release blocker? PyG advanced example has this already implemented so we simply need to update other examples in a similar manner.
🔨Work Item
IMPORTANT:
Project tracker: https://github.com/orgs/dmlc/projects/2
Description
We should consider moving the
copy_to
operation before the feature fetch stage so that the overlap optimization is enabled for thecpu-cuda
mode as well. However, the features need to be pinned for this to work. On my machine, I got a 2.2x speedup by doing so.@Rhett-Ying @frozenbugs