Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.39k stars 3.38k forks source link

Memory leaking when using large numpy array in Dataset #1761

Closed mpaepper closed 4 years ago

mpaepper commented 4 years ago

🐛 Bug

Thank you for the great library! When migrating a larger project, I am running into memory issues, though, so maybe someone can help me out.

So, I have a pretty complicated DataSet which loads plenty of data and buffers the data into the CPU RAM as a numpy array.

I train using ddp and with num_workers = 6 in the dataloader. The training crashes my machine, because of CPU memory overflow. It works with num_workers = 0, but the higher the num_workers, the higher the memory consumption.

I figured out that this is much worse when using a large numpy array in the Dataset rather than a PyTorch tensor. Unfortunately, I need numpy arrays, so I am asking you if there is anything I can do?

To Reproduce

I created a repository to reproduce this. It allows you to train a model on toy data using either a PyTorch tensor or a numpy array in the Dataset.

When running it with the PyTorch tensor the same amount of data uses 5GB of RAM while with Numpy it uses more than 30GB of RAM. The higher the number of num_workers, the higher the RAM usage - it seems to leak when using numpy?

  1. Clone https://github.com/mpaepper/reproduce_pytorch_lightning_memory_issues
  2. Try the PyTorch tensor with: python minimal.py --num_workers 10
  3. Try the numpy array with: python minimal.py --numpy --num_workers 10
  4. Compare the huge difference in memory consumption

Code sample

https://github.com/mpaepper/reproduce_pytorch_lightning_memory_issues

Expected behavior

I would expect that numpy and PyTorch tensors should behave in the same way when using num_workers > 0, i.e. memory consumption is similar.

Environment

github-actions[bot] commented 4 years ago

Hi! thanks for your contribution!, great first issue!

bjmnbraun commented 4 years ago

I had a similar issue and if I recall defining environmental variable COLAB_GPU forces pytorch lightning to use fork, which might prevent this Nx memory blowup.

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L779

mpaepper commented 4 years ago

Thank you for the answer, but it seems that that option only works for TPU training? I am training on GPUs.

I tried it out anyways, but it didn't improve my situation. Any other pointers / ideas?

mpaepper commented 4 years ago

I tried to manually rewrite the PyTorch Lightning code to use fork instead of spawn, but then the error "Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method" comes up:

Process Process-1:
Traceback (most recent call last):
  File "/home/xxx/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/xxx/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 345, in ddp_train
    torch.cuda.set_device(self.root_gpu)
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/cuda/__init__.py", line 292, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/xxx/anaconda3/lib/python3.7/site-packages/torch/cuda/__init__.py", line 195, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Borda commented 4 years ago

mind be similar to #1769

mpaepper commented 4 years ago

So for others running into this:

As a workaround, during __init__ I move everything from numpy to PyTorch tensors, so they are stored in RAM -> then the shared memory works. When I use them, I transform them back from PyTorch to numpy (.detach().numpy()). However, it might fail when you have large amounts of memories stored here, because the file limits of your operating system don't allow you to have enough files open.

Check out ulimit -n (was 1024 for me).

Setting it to a higher limit with ulimit -n 9999 then fixed the error and training works.

However, it still seems too slow. It's only half as fast as it was using Torchbearer before.

The more num_workers I use in the Dataloader, the slower the start of an epoch similar as described here in this issue: https://github.com/PyTorchLightning/pytorch-lightning/issues/1884

williamFalcon commented 4 years ago

@mpaepper check again? This should be fixed on master now

mpaepper commented 4 years ago

Yes, thank you. It's resolved with the recent master additions :+1: