Open gpzlx1 opened 2 years ago
It seems that DGL pays too much time for checking. But sometimes, developers can make sure everything is ok. Theoretically, it should be nearly zero-cost for DGL adaptor, just convert the tensor to dgl.nd_array.
By breaking down, we can find that in DGL adaptor
What is the tool you used to get those plots? any reference link? Thanks
@decoherencer . Just nsight system
and nvtx
.
nsight system
: https://developer.nvidia.com/nsight-systems
nvtx
: https://pytorch.org/docs/stable/cuda.html#nvidia-tools-extension-nvtx
my_create_block
. But yes, we don't have it in DGL. Either create_block
or to_block
is much heavier.
❓ Questions and Help
Now, there are many third-party graph sampling frameworks, like torch-quiver, which may be more flexible or have higher performance. DGL provides create_block to help developers finish their adaptor. Unfortunately, compared with PyG, the adaptor for DGL is a little heavy, due to its complex packaging for graph.
dataset = reddit
,fan_out = [25,10]
,batch_size=1024
, usingtorch-quiver
sampling on the GPU, caching all data on GPU memory, the E2E training one epoch time cost are:DGL(create_block) is 8.7% slower than PyG.
By breaking down, we can find that in DGL adaptor
create_block
can account up for 40%+ of the time cost in sampling stage. It's too heavy.To reduce the overhead, I write a simple
create_block
calledmy_create_block
. Code is followingAlthough,
DGL(my_create_block)
can reach3.03
sec per epoch.my_create_block
still take 15.3% time cost in sampling stage.Is there any way to provide a more lightweight
create_block
API or very low-level (C++
is ok) but high performance APIs for developers to write an efficient Adaptor?