Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.06k stars 3.36k forks source link

Prefetch in LightingDataModule PR #4803

Closed kartik4949 closed 3 years ago

kartik4949 commented 3 years ago

๐Ÿš€ Feature

Thinking to file a pr for Prefetching feature in dataloader module which is very useful tool for optimization of datapipeline,

Motivation

as a TensorFlow user, i find the Tensorflow.dataset prefetch feature a useful thing and easy optimization.

Pitch

Alternatives

Additional context

1 way

from pytorch_lightning import LightningDataModule
dm = MNISTDataModule(..., prefetch=True)

2 way

class MNISTDataModule(LightningDataModule):
    .
    .
    .
    @property
    def default_transforms(self):
        if not TORCHVISION_AVAILABLE:
            return None
        if self.normalize:
            mnist_transforms = transform_lib.Compose(
                [transform_lib.ToTensor(), transform_lib.Normalize(mean=(0.5,), std=(0.5,))]
            )
        else:
            mnist_transforms = transform_lib.ToTensor()

        return mnist_transforms
     def optimize(self):
           optimizations = [self.prefetch('AUTO'), self.cache(10)]
           return optimizations
awaelchli commented 3 years ago

Hi, could you add more information how it differs from the way PyTorch DataLoaders work and more motivation why such a feature should live in Lightning and not PyTorch or a separate library? Also please if you can add some prior work section, I am sure there are some libraries out there that do what you need. I propose to first make sure that existing methods work well with Lightning and if they don't we can see how to integrate better :)

kartik4949 commented 3 years ago

@awaelchli Hi,

  1. you can check tensorflow.dataset and look for prefetch there, this feature decreases input pipeline and graph bottlenecks.
  2. PyTorch Dataloader in my knowledge don't have prefetch support below is the link to discuss ,"prefetch in pytorch" one of the facebook AI research developer answered: "there isnโ€™t a prefetch option, but you can write a custom Dataset that just loads the entire data on GPU and returns samples from in-memory. In that case you can just use 0 workers in your DataLoader" :)
rohitgr7 commented 3 years ago

wouldn't it cause memory issues if whole data is loaded in the memory at once if data is huge??

which is very useful tool for optimization of datapipeline

what kind of optimization specifically?

one of the facebook AI research developer answered:

btw is a co-creator of PyTorch :smile:

kartik4949 commented 3 years ago

@rohitgr7 Co-Creator wow!! 1. So, prefetching is basically prefetch 'n' number for samples from the DataPipeline, This is 'n' is user defined or automatically picked based on computational / space resource. prefetching is done in two ways. prefetch(auto) -> framework automatically picks n for us. prefetch(1) -> (recommended) This means prefetch one sample from training/val pipeline till the graph computes.

  1. what kind of optimization Tensorflow.org says "Prefetching overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data."

Prefetch: image Naive: image

Thanks :)

kartik4949 commented 3 years ago

@rohitgr7 so in nutshell , it doesnt loads the entire data at once :)

awaelchli commented 3 years ago

Prefetching overlaps the preprocessing and model execution of a training step

This is already happening with PyTorch dataloaders. Setting num_workers=x will fork/spawn x processes that load data in parallel into a queue. See here section called "Single- and Multi-process Data Loading". I thought you are talking about device transfers?

~Btw, above you point to the wrong figures even though the titles are showing which one is which.~

prefetch(1) -> (recommended) This means prefetch one sample from training/val pipeline till the graph computes.

Closest I could find is DataLoader(workers=1, prefetch_factor=1), that's pretty much the same right? src: https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader

As you can see, I put links where I got my information from. Please do that as well so we know where you have the information and figures from and we can read up on it, thanks. I am not familiar with TF.

rohitgr7 commented 3 years ago

DataLoader(workers=1, prefetch_factor=1)

TIL

kartik4949 commented 3 years ago

@awaelchli @rohitgr7 Hi, as you @awaelchli pinpointed, there can be two prefetch, one is cpu level and another gpu level (gpu programming optimization)

  1. cpu prefetch pytorch already does this (reiterated pytorch discussions.. and codebase) src: https://discuss.pytorch.org/t/how-to-prefetch-data-when-processing-with-gpu/548
  2. gpu prefetch(device transfers.. so we don't spend expensive transfer time from host to device(GPU) ), pytorch is lacking idk if this feature should be added here.

Thanks, we can close this thread, if we are not planning to add gpu prefetch support (which doesnt make sense here)! nice dicussion :)

awaelchli commented 3 years ago

It is possible to overlap data transfers and model compute with the non_blocking=True option (see https://pytorch.org/docs/stable/notes/cuda.html?highlight=non_blocking section "Pinned Memory Buffers". Lightning does this already, but it's not equivalent to a queue.

Since the bottleneck is often the CPU prefetching and processing of data, the transfers to GPU can often be neglected. Memory pinning and the non_blocking I just mentioned provide enough flexibility, at least from my experience. My guess is this is the reason why PyTorch doesn't have any special GPU prefetching logic.

That being said, we are of course always open for new features that remove bottlenecks and get the most out of the hardware :) If you can (or someone else) come up with a concrete idea and can present a proof of concept with benchmarks so that we see the benefit of this GPU prefetching working on a real example, then I would be more than happy to see and test it myself!

awaelchli commented 3 years ago

It might also be worth looking in to DALI https://developer.nvidia.com/DALI

kartik4949 commented 3 years ago

@awaelchli Will look into it and try to remove bottlenecks Thanks !

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!