IgorSusmelj / pytorch-styleguide

An unofficial styleguide and best practices summary for PyTorch
GNU General Public License v3.0
1.92k stars 170 forks source link

Should we use BackgroundGenerator when we've had DataLoader? #5

Open yzhang1918 opened 5 years ago

yzhang1918 commented 5 years ago

I really enjoy this guide! However, I am not sure what the advantage of prefetch_generator is. It seems that DataLoader in pytorch has already supported prefetching.

Thank you!

IgorSusmelj commented 5 years ago

To the best of my knowledge, the DataLoader in Pytorch is creating a set of worker threads which all prefetch new data at once when all workers are empty.

So if for example, you create 8 worker threads:

  1. All 8 threads prefetch data
  2. Until you empty all of them (make for example 8 train iterations) none of the workers fetches new data

Using the prefetch generator we make sure that each of those workers always has at least 1 additional data item loaded.

You can see this behavior if you create a very shallow network.

I have here two colab notebooks (based on the CIFAR10 example from the official tutorial):

Here with data loader and 2 workers: https://colab.research.google.com/drive/10wJIfCw5moPc-Yx9rSqWFEXkNceAOPpc

Here with the additional prefetch_generator: https://colab.research.google.com/drive/1WQ8c-RIZ7FMhfsm8dtRpsqiIR_KuZ49Z

Output without prefetch_generator Output with prefetch_generator
Compute efficiency: 0.09, iter 1 Compute efficiency: 0.61, iter 1
Compute efficiency: 0.98, iter 2 Compute efficiency: 0.99, iter 2
Compute efficiency: 0.61, iter 3 Compute efficiency: 0.98, iter 3
Compute efficiency: 0.98, iter 4 Compute efficiency: 0.99, iter 4
Compute efficiency: 0.67, iter 5 Compute efficiency: 0.99, iter 5
Compute efficiency: 0.71, iter 6 Compute efficiency: 0.99, iter 6
Avg time per epoch: 328ms Avg time per epoch: 214ms

This is why keeping track of computing vs data loading time (aka compute efficiency) is important. In this simple example, we even save lots of training time.

If anyone knows how to fix this behavior in the PyTorch data loader let me know :)

yzhang1918 commented 5 years ago

Thank you for your wonderful example! Now I use the following class to replace the default DataLoader everywhere in my code. XD

from torch.utils.data import DataLoader
from prefetch_generator import BackgroundGenerator

class DataLoaderX(DataLoader):

    def __iter__(self):
        return BackgroundGenerator(super().__iter__())
ryul99 commented 4 years ago

I had a problem using BackgroundGenerator with PyTorch Distributed Data Parallel. When I turn DDP and BackgroundGenerator both on and iterate dataloader, processes that are not in rank 0 loaded something to rank 0 GPU. I solved this issue by turning off BackgroundGenerator when I use DDP.

ppwwyyxx commented 3 years ago

the DataLoader in Pytorch is creating a set of worker threads

Technically no, it creates worker processes

Until you empty all of them (make for example 8 train iterations) none of the workers fetches new data

pytorch does not do this

I have here two colab notebooks (based on the CIFAR10 example from the official tutorial): Here with data loader and 2 workers: https://colab.research.google.com/drive/10wJIfCw5moPc-Yx9rSqWFEXkNceAOPpc Here with the additional prefetch_generator:

This is a flawed benchmark that doesn't actually show the importance of prefetching -- it runs the fastest without any prefetching: when setting num_workers=0 and do NOT use BackgroundGenerator, it prints 150ms, faster than what's in both colab notebook.

IgorSusmelj commented 3 years ago

A quick update on this one. PyTorch 1.7 introduced a configurable prefetching parameter for the DataLoader https://pytorch.org/docs/stable/data.html

I didn't do any benchmarking yet. But I can imagine that the integrated prefetching makes this prefetch_generator obsolete for PyTorch.

DZ9 commented 3 years ago

I had a problem using BackgroundGenerator with PyTorch Distributed Data Parallel. When I turn DDP and BackgroundGenerator both on and iterate dataloader, processes that are not in rank 0 loaded something to rank 0 GPU. I solved this issue by turning off BackgroundGenerator when I use DDP.

I got exactly the same problem. But thurning off BackgroundGenerator in DDP would make the data sample phase much slower. Is there any better solutions for this?