Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28k stars 3.36k forks source link

TPU error: RAM full, page stopped responding and slower than GPU on google colab #1403

Closed OliverCWY closed 4 years ago

OliverCWY commented 4 years ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. Open lightning_mnist_tpu.ipynb
  2. Run the code

Expected behavior

The code runs normally and faster than GPU.

Error

  1. The webpage stopped responding soon after running the trainer, on several devices such as PC, phone and puffin browser, with Ram reaching 100% on PC. (both GPU and TPU)
  2. Iteration speed for TPU calculations is ~30 it/s while iteration speed for GPU is >90 it/s.

Additional context

Running the demo notebook Lightning-demo.ipynb on TPU solved the first error but the iteration speed is still slower for TPU, with perpare_data added.

github-actions[bot] commented 4 years ago

Hi! thanks for your contribution!, great first issue!

Borda commented 4 years ago

@OliverCWY may you share a link to the notebook?

OliverCWY commented 4 years ago

@OliverCWY may you share a link to the notebook?

Sorry I did not make it clear that I was using the official TPU demo notebook https://colab.research.google.com/drive/1-_LKx4HwAxl5M6xPJmqAAu444LTDQoa3

Borda commented 4 years ago

well for me it is falling on TQDM error:

Exception in device=TPU:0: 'tqdm_notebook' object has no attribute 'leave'
  File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close
    if self.leave:
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 505, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 850, in run_pretrain_routine
    self.val_progress_bar.close()
  File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close
    if self.leave:
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
Exception in device=TPU:1: 'tqdm_notebook' object has no attribute 'leave'
Exception in device=TPU:7: 'tqdm_notebook' object has no attribute 'leave'

@OliverCWY may you share your error?

OliverCWY commented 4 years ago

@Borda There are no error messages for me, despite for a lot of warnings. The webpage stopped responding even when I set warnings.filter("ignore"). One possible reason is that the tqdm progress bar reloads for every update without freeing the memory, but the problem only exists in the TPU demo notebook. When I copy the codes into the demo notebook (https://colab.research.google.com/drive/1IqfISTenqy50Fq8DafCmm8KfUf9JssJF), everything is fine except for the iteration speed.

utsavnandi commented 4 years ago

well for me it is falling on TQDM error:

Exception in device=TPU:0: 'tqdm_notebook' object has no attribute 'leave'
  File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close
    if self.leave:
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 505, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 850, in run_pretrain_routine
    self.val_progress_bar.close()
  File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close
    if self.leave:
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
Exception in device=TPU:1: 'tqdm_notebook' object has no attribute 'leave'
Exception in device=TPU:7: 'tqdm_notebook' object has no attribute 'leave'

@OliverCWY may you share your error?

If you restart the runtime, tqdm error should go away.

utsavnandi commented 4 years ago

I tried to profile a efficientnet_es model on cifar10. It is taking ~500+ seconds for forward propagation but only ~47 seconds for back prop. Also, it is taking over 15 mins to run 1 epoch which doesn't seem right. This was over 6 epochs.

OliverCWY commented 4 years ago

I used lightning on colab for other models and they all had this problem.

OliverCWY commented 4 years ago

I used lightning on colab for other models and they all had this problem.

For GPU as well

williamFalcon commented 4 years ago

colabs are slow with a low refresh frequency. set the tqdm freq refresh to 10 or something more than 1

williamFalcon commented 4 years ago

can you share the colabs? we have speed benchmarks in CI, and lightning is a few seconds slower than pure pytorch because of the loggers and tqdm bar but not slower by much (ie: if you added tensorboard to your code it would be as slow).

this is likely because you’re not putting something on GPU or something like that

williamFalcon commented 4 years ago

Just tested on colab... it works fine https://colab.research.google.com/drive/1-_LKx4HwAxl5M6xPJmqAAu444LTDQoa3#scrollTo=kr8cql-aaKnC

OliverCWY commented 4 years ago

Just tested on colab... it works fine https://colab.research.google.com/drive/1-_LKx4HwAxl5M6xPJmqAAu444LTDQoa3#scrollTo=kr8cql-aaKnC

The memory used by the iframes in google colab reaches 600+MB after the 29th epoch and continues to increase so probably setting the refresh requency does not actually address the problem.

And I am actually referring to the speed using different devices with pytorch-lightning. Running on TPU is significantly slower than running on GPU or even CPU.

Using a single layer of nn.Linear: TPU: TPU_0 CPU: CPU_0 GPU(P100): P100_0 With more layers: TPU: TPU_1 CPU: CPU_1 GPU(P100): P100_1

williamFalcon commented 4 years ago

speed fixed on 0.7.3 The RAM issue is a colab issue not a PL issue. Crash the ram using cell 1 or upgrade to PRO

OliverCWY commented 4 years ago

speed fixed on 0.7.3 The RAM issue is a colab issue not a PL issue. Crash the ram using cell 1 or upgrade to PRO

@williamFalcon Sorry but I don't think the problem is solved. Just tested on colab:

version CPU_0 TPU_0

I am referring to the memory used by my browser running colab when saying "RAM full", so it is not the problem of the backend.

Thank you for you patience.

hfwittmann commented 4 years ago

I am afraid this does not work for me either, hence I also don't think that the problem is solved.

I have tried all the versions, given in the notebook.

Additionally, I have also tried it with the version 20200516. That version is given in the official colab TPU MNIST example notebook which does not use pytorch-lightening, ie 20200516. A reference is below in NB2.

The summary of the results are:

"1.5" : wont run at all "20200325" hangs in the final epoch (with 10 epochs in the 10th, with 3 epochs in the 3rd) "nightly" crashes with : Exception: process 0 terminated with signal SIGABRT

"20200516" hangs after one epoch

I have tried this several times over the last few days. With the exception of the nightly all these results have always been the same.

NB1: Locally I am on a Mac, not sure whether this makes a difference.

My terminal gives this

uname -a Darwin osx-lhind6957 18.7.0 Darwin Kernel Version 18.7.0: Mon Apr 27 20:09:39 PDT 2020; root:xnu-4903.278.35~1/RELEASE_X86_64 x86_64

NB2: The links for that official colab TPU MNIST example notebook which does not use pytorch lightning are here: https://cloud.google.com/tpu/docs/colabs?hl=de

https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/mnist-training.ipynb?authuser=1#scrollTo=sPJVqAKyml5W

(The official notebook which does not use pytorch lightning has no problem and runs through with 20200516)