Dataloaders making use of DDP and _ScalarDataBatcher are not memory scalable, currently O(N*P).

nv-dlasalle commented 3 years ago

🐛 Bug

Currently, the dataloaders when use_ddp=True, use O(N*P) memory and have spend O(N) to shuffle the dataset when using P processes.

This problem is most prominent when using an EdgeDataLoader where the number of input edges is on the same order as the number of edges in the graph, while using multiple GPUs, due to the way the permutation is generated (from https://github.com/dmlc/dgl/blob/master/python/dgl/dataloading/pytorch/dataloader.py#L134):

        if self.shuffle:
            # deterministically shuffle based on epoch and seed
            g = th.Generator()
            g.manual_seed(self.seed + self.epoch)
            indices = th.randperm(len(self.dataset), generator=g)
        else:
            indices = th.arange(len(self.dataset))

        if not self.drop_last:
            # add extra samples to make it evenly divisible
            indices = th.cat([indices, indices[:(self.total_size - indices.shape[0])]])
        else:
            # remove tail of data to make it evenly divisible.
            indices = indices[:self.total_size]
        assert indices.shape[0] == self.total_size

This allocate an array of length the entire training dataset in each process. When using 8 training processes on a dense graph, this means just the dataloader we can end up using 8 times more memory than for storing the graph itself.

This problem also appears to be present when _ScalarDataBatcher is not used, but I believe that is a problem with how pytorch creates a list of batch items, and outside of the scope of DGL.

Expected behavior

From a user perspective, it would be expected that using more GPUs would not significantly increase the amount of CPU memory required.

Environment

DGL Version (e.g., 1.0): 0.7.1
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): pytorch 1.9
OS (e.g., Linux): ubuntu 20.04
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.6
CUDA/cuDNN version (if applicable): 11.3
GPU models and configuration (e.g. V100): A100
Any other relevant information:

Additional context

A memory scalable way to generate the indices is needed. Either by splitting the target edges/nodes statically among processes, and only shuffling within a process, or have processes cooperatively generate a split permutation vector, such that each process only hold N/P indices.

For graphs with fewer mini-batches, this N*P operation also takes a significant amount time per-epoch in the multi-gpu setting, where its always cost O(N) irregardless of how many processes are used.

BarclayII commented 3 years ago

Seems that PyTorch's DistributedSampler has the same problem (the code above is borrowed from PyTorch). How do they handle multi-GPU shuffling with a large dataset?

nv-dlasalle commented 3 years ago

I suspect it targets datasets made up of images or text/audio, where the size of the index is orders of magnitude smaller than the dataset itself. I think this is a problem specific to GNNs.

nv-dlasalle commented 3 years ago

If we statically split the train_nid (stripe it across the training processes), and then only permute it within each process at each epoch, is there a concern this could interfere with convergence?

yaox12 commented 2 years ago

Is there a concern this could interfere with convergence?

@nv-dlasalle This may not harm the performance in practice. For example, this codebase uses a static split of node ids. And I think the graph partition in DistDGL is just similar to this. But one thing I'm concerning is that should we make such a compromise (to some content) as a DLFW provider?

To fix this, I also have some questions:

Should we take care of the case when _ScalarDataBatcher is not used? It's outside of the scope of DGL, and trying to solve it may be a bit complicated.
To have processes cooperatively generate a split permutation vector, I think we need a shared vector of all train_ids, which cost a constant O(n) extra space. Do you consider this acceptable?

BarclayII commented 2 years ago

I'm supportive of using static splits. One thing I'm wondering though is that when use_ddp=True, do we still have the problem even if we are not using _ScalarDataBatcher (i.e. heterogeneous graphs)?

Another concern of _ScalarDataBatcher is that it may interfere with PyTorch's minibatch sampler (e.g. #3431 ).

Costing O(n) seems acceptable versus O(n*p).

nv-dlasalle commented 2 years ago

@BarclayII It looks to me like pytorch's DistributedSampler suffers from the same problem https://github.com/pytorch/pytorch/blob/v1.8.2/torch/utils/data/distributed.py#L96. But I think it assumes that the size of the indices is orders of magnitude smaller than the size of the dataset. For single-layer unsupervised GNNs with fanout, this isn't the case however.

I agree, I think costing O(n) space isn't a problem. It be nice the computation to generate the permutation was O(n/p) so as to not negatively impact scalability as we go from one GPU to many.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] commented 2 years ago

This issue is closed due to lack of activity. Feel free to reopen it if you still have questions.

dmlc / dgl