Closed OliverCWY closed 4 years ago
Hi! thanks for your contribution!, great first issue!
@OliverCWY may you share a link to the notebook?
@OliverCWY may you share a link to the notebook?
Sorry I did not make it clear that I was using the official TPU demo notebook https://colab.research.google.com/drive/1-_LKx4HwAxl5M6xPJmqAAu444LTDQoa3
well for me it is falling on TQDM error:
Exception in device=TPU:0: 'tqdm_notebook' object has no attribute 'leave'
File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close
if self.leave:
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 505, in tpu_train
self.run_pretrain_routine(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 850, in run_pretrain_routine
self.val_progress_bar.close()
File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close
if self.leave:
AttributeError: 'tqdm_notebook' object has no attribute 'leave'
Exception in device=TPU:1: 'tqdm_notebook' object has no attribute 'leave'
Exception in device=TPU:7: 'tqdm_notebook' object has no attribute 'leave'
@OliverCWY may you share your error?
@Borda There are no error messages for me, despite for a lot of warnings. The webpage stopped responding even when I set warnings.filter("ignore")
.
One possible reason is that the tqdm progress bar reloads for every update without freeing the memory, but the problem only exists in the TPU demo notebook. When I copy the codes into the demo notebook (https://colab.research.google.com/drive/1IqfISTenqy50Fq8DafCmm8KfUf9JssJF), everything is fine except for the iteration speed.
well for me it is falling on TQDM error:
Exception in device=TPU:0: 'tqdm_notebook' object has no attribute 'leave' File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close if self.leave: AttributeError: 'tqdm_notebook' object has no attribute 'leave' AttributeError: 'tqdm_notebook' object has no attribute 'leave' Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn fn(gindex, *args) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 505, in tpu_train self.run_pretrain_routine(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 850, in run_pretrain_routine self.val_progress_bar.close() File "/usr/local/lib/python3.6/dist-packages/tqdm/notebook.py", line 247, in close if self.leave: AttributeError: 'tqdm_notebook' object has no attribute 'leave' Exception in device=TPU:1: 'tqdm_notebook' object has no attribute 'leave' Exception in device=TPU:7: 'tqdm_notebook' object has no attribute 'leave'
@OliverCWY may you share your error?
If you restart the runtime, tqdm error should go away.
I tried to profile a efficientnet_es model on cifar10. It is taking ~500+ seconds for forward propagation but only ~47 seconds for back prop. Also, it is taking over 15 mins to run 1 epoch which doesn't seem right. This was over 6 epochs.
I used lightning on colab for other models and they all had this problem.
I used lightning on colab for other models and they all had this problem.
For GPU as well
colabs are slow with a low refresh frequency. set the tqdm freq refresh to 10 or something more than 1
can you share the colabs? we have speed benchmarks in CI, and lightning is a few seconds slower than pure pytorch because of the loggers and tqdm bar but not slower by much (ie: if you added tensorboard to your code it would be as slow).
this is likely because you’re not putting something on GPU or something like that
Just tested on colab... it works fine https://colab.research.google.com/drive/1-_LKx4HwAxl5M6xPJmqAAu444LTDQoa3#scrollTo=kr8cql-aaKnC
Just tested on colab... it works fine https://colab.research.google.com/drive/1-_LKx4HwAxl5M6xPJmqAAu444LTDQoa3#scrollTo=kr8cql-aaKnC
The memory used by the iframes in google colab reaches 600+MB after the 29th epoch and continues to increase so probably setting the refresh requency does not actually address the problem.
And I am actually referring to the speed using different devices with pytorch-lightning. Running on TPU is significantly slower than running on GPU or even CPU.
Using a single layer of nn.Linear: TPU: CPU: GPU(P100): With more layers: TPU: CPU: GPU(P100):
speed fixed on 0.7.3 The RAM issue is a colab issue not a PL issue. Crash the ram using cell 1 or upgrade to PRO
speed fixed on 0.7.3 The RAM issue is a colab issue not a PL issue. Crash the ram using cell 1 or upgrade to PRO
@williamFalcon Sorry but I don't think the problem is solved. Just tested on colab:
I am referring to the memory used by my browser running colab when saying "RAM full", so it is not the problem of the backend.
Thank you for you patience.
I am afraid this does not work for me either, hence I also don't think that the problem is solved.
I have tried all the versions, given in the notebook.
Additionally, I have also tried it with the version 20200516. That version is given in the official colab TPU MNIST example notebook which does not use pytorch-lightening, ie 20200516. A reference is below in NB2.
The summary of the results are:
"1.5" : wont run at all "20200325" hangs in the final epoch (with 10 epochs in the 10th, with 3 epochs in the 3rd) "nightly" crashes with : Exception: process 0 terminated with signal SIGABRT
"20200516" hangs after one epoch
I have tried this several times over the last few days. With the exception of the nightly all these results have always been the same.
NB1: Locally I am on a Mac, not sure whether this makes a difference.
My terminal gives this
uname -a Darwin osx-lhind6957 18.7.0 Darwin Kernel Version 18.7.0: Mon Apr 27 20:09:39 PDT 2020; root:xnu-4903.278.35~1/RELEASE_X86_64 x86_64
NB2: The links for that official colab TPU MNIST example notebook which does not use pytorch lightning are here: https://cloud.google.com/tpu/docs/colabs?hl=de
(The official notebook which does not use pytorch lightning has no problem and runs through with 20200516)
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The code runs normally and faster than GPU.
Error
Additional context
Running the demo notebook Lightning-demo.ipynb on TPU solved the first error but the iteration speed is still slower for TPU, with perpare_data added.