K-Wu / pytorch-direct_dgl

PyTorch-Direct code on top of PyTorch-1.8.0nightly (e152ca5) for Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture (accepted by PVLDB)
https://arxiv.org/abs/2103.03330
45 stars 5 forks source link

multi gpu #1

Open yufengwhy opened 2 years ago

yufengwhy commented 2 years ago

Can we use this code with multi gpus? if so, give some examples in readme? thx~

say, there are 1 billion nodes and 60 billion edges, so the matrix will be 500G while A100 has 80G memory.

davidmin7 commented 2 years ago

Hi, thank you for your interest in our work. Yes, you can use multiGPU with this implementation by allocating shared memory space and pinning it with the unified tensor. However, we are pushing our idea on DGL repository and some upgrades are coming soon https://github.com/dmlc/dgl/pull/3616. So you can take a look at there as well!

yufengwhy commented 2 years ago

I think huge matrix multiplication is very basic, which is not only for graph, so we can implement it as a basic unit, not coupled with graph?

davidmin7 commented 2 years ago

Hi, yes the latter link is about the graph (if you need), but DGL supports unified tensor. Please see the document link here https://docs.dgl.ai/en/latest/api/python/dgl.contrib.UnifiedTensor.html. You simply need to declare the unified tensor on a shared memory space.

davidmin7 commented 2 years ago

Some quick examples (may need some syntax corrections):

For the single GPU case:

def train(feat_matrix, ...):
    feat_matrix_unified = dgl.contrib.UnifiedTensor(feat_matrix, device=torch.device('cuda'))
    # Access feat_matrix_unified from here
    # e.g., a = feat_matrix_unified[indices]
    # !!! "indices" must be a cuda tensor !!!

if __name__ == '__main__':
    ...
    feat_matrix = dataload() # some user defined function here to load data into torch tensor
    train(feat_matrix, ...)

For the multi GPU case:

def train(feat_matrix, device, ...):
    torch.cuda.set_device(device)
    feat_matrix_unified = dgl.contrib.UnifiedTensor(feat_matrix, device=torch.device(device))
    # Access feat_matrix_unified from here
    # e.g., a = feat_matrix_unified[indices]
    # !!! "indices" must be a cuda tensor !!!

if __name__ == '__main__':
    ...
    feat_matrix = dataload() # some user defined function here to load data into torch tensor
    feat_matrix = feat_matrix.share_memory_()
    ...
    for proc_id in range(n_gpus):
        p = mp.Process(target=train, args=(feat_matrix, n_gpus, ...))
        p.start()

Hope this helps!