dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.55k stars 3.01k forks source link

More lightweight create_block is needed. #3960

Open gpzlx1 opened 2 years ago

gpzlx1 commented 2 years ago

❓ Questions and Help

Now, there are many third-party graph sampling frameworks, like torch-quiver, which may be more flexible or have higher performance. DGL provides create_block to help developers finish their adaptor. Unfortunately, compared with PyG, the adaptor for DGL is a little heavy, due to its complex packaging for graph.

For graphsage with dataset = reddit, fan_out = [25,10], batch_size=1024, using torch-quiver sampling on the GPU, caching all data on GPU memory, the E2E training one epoch time cost are: DGL(create_block) PyG DGL(my_create_block)
3.13 sec 2.88 sec 3.03 sec

DGL(create_block) is 8.7% slower than PyG.

By breaking down, we can find that in DGL adaptor create_block can account up for 40%+ of the time cost in sampling stage. It's too heavy.

image

To reduce the overhead, I write a simple create_block called my_create_block. Code is following

from dgl.heterograph import DGLBlock
def my_create_block(arrays, num_src_nodes, num_dst_nodes):  
    torch.cuda.nvtx.range_push('1')
    hgidx = dgl.heterograph_index.create_unitgraph_from_coo(
            2, num_src_nodes, num_dst_nodes, arrays[0], arrays[1], ['coo', 'csr', 'csc'],
            row_sorted=False, col_sorted=True)
    torch.cuda.nvtx.range_pop()

    torch.cuda.nvtx.range_push('2')
    retg = DGLBlock(hgidx, (['_N'], ['_N']), ['_E'])
    torch.cuda.nvtx.range_pop()

    return retg

Although, DGL(my_create_block) can reach 3.03 sec per epoch. my_create_block still take 15.3% time cost in sampling stage.

image

Is there any way to provide a more lightweight create_block API or very low-level (C++ is ok) but high performance APIs for developers to write an efficient Adaptor?

gpzlx1 commented 2 years ago

It seems that DGL pays too much time for checking. But sometimes, developers can make sure everything is ok. Theoretically, it should be nearly zero-cost for DGL adaptor, just convert the tensor to dgl.nd_array.

decoherencer commented 2 years ago

By breaking down, we can find that in DGL adaptor

What is the tool you used to get those plots? any reference link? Thanks

gpzlx1 commented 2 years ago

@decoherencer . Just nsight system and nvtx. nsight system : https://developer.nvidia.com/nsight-systems nvtx : https://pytorch.org/docs/stable/cuda.html#nvidia-tools-extension-nvtx

yaox12 commented 1 year ago