More lightweight create_block is needed.

gpzlx1 commented 2 years ago

❓ Questions and Help

Now, there are many third-party graph sampling frameworks, like torch-quiver, which may be more flexible or have higher performance. DGL provides create_block to help developers finish their adaptor. Unfortunately, compared with PyG, the adaptor for DGL is a little heavy, due to its complex packaging for graph.

For graphsage with `dataset = reddit`, `fan_out = [25,10]`, `batch_size=1024`, using `torch-quiver` sampling on the GPU, caching all data on GPU memory, the E2E training one epoch time cost are:	DGL(create_block)	PyG	DGL(my_create_block)
3.13 sec	2.88 sec	3.03 sec

DGL(create_block) is 8.7% slower than PyG.

By breaking down, we can find that in DGL adaptor create_block can account up for 40%+ of the time cost in sampling stage. It's too heavy.

To reduce the overhead, I write a simple create_block called my_create_block. Code is following

from dgl.heterograph import DGLBlock
def my_create_block(arrays, num_src_nodes, num_dst_nodes):  
    torch.cuda.nvtx.range_push('1')
    hgidx = dgl.heterograph_index.create_unitgraph_from_coo(
            2, num_src_nodes, num_dst_nodes, arrays[0], arrays[1], ['coo', 'csr', 'csc'],
            row_sorted=False, col_sorted=True)
    torch.cuda.nvtx.range_pop()

    torch.cuda.nvtx.range_push('2')
    retg = DGLBlock(hgidx, (['_N'], ['_N']), ['_E'])
    torch.cuda.nvtx.range_pop()

    return retg

Although, DGL(my_create_block) can reach 3.03 sec per epoch. my_create_block still take 15.3% time cost in sampling stage.

Is there any way to provide a more lightweight create_block API or very low-level (C++ is ok) but high performance APIs for developers to write an efficient Adaptor?

gpzlx1 commented 2 years ago

It seems that DGL pays too much time for checking. But sometimes, developers can make sure everything is ok. Theoretically, it should be nearly zero-cost for DGL adaptor, just convert the tensor to dgl.nd_array.

decoherencer commented 2 years ago

By breaking down, we can find that in DGL adaptor

What is the tool you used to get those plots? any reference link? Thanks

gpzlx1 commented 2 years ago

@decoherencer . Just nsight system and nvtx. nsight system : https://developer.nvidia.com/nsight-systems nvtx : https://pytorch.org/docs/stable/cuda.html#nvidia-tools-extension-nvtx

yaox12 commented 1 year ago

https://github.com/dmlc/dgl/pull/4556 should have greatly alleviated the cost of my_create_block. But yes, we don't have it in DGL. Either create_block or to_block is much heavier.

dmlc / dgl

More lightweight create_block is needed. #3960

❓ Questions and Help