Logging issue on TPU VM Pod

tgisaturday commented 3 years ago

🐛 Bug

Please reproduce using the BoringModel

Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer. While running code on Google Cloud TPU VM Pod v3-8 successfully runs, process crashes on Google Cloud TPU VM Pod v3-32 (not Pod Node).

To Reproduce

Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer (for TPU support).

Expected behavior

Run without crash on v3-32.

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

IDE: Please, use our python bug_report_model.py template.
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

TPU VM Pod Software: v2-alpha

PyTorch Version (e.g., 1.0): 1.8.1
OS (e.g., Linux): Ubuntu
How you installed PyTorch (conda, pip, source): bulit-in image in v2-alpha

Additional context

I've been also testing simple MNIST GAN code and same problem appears. My custom code crashes when Trainer.fit() automatically tries to save checkpoints with trainer.save_checkpoint. Here are test codes that I've used. testcode.zip

kaushikb11 commented 3 years ago

Hey there! As you are trying to run on a TPU Pod, you would need to run

python -m torch_xla.distributed.xla_dist --tpu=$TPU_NAME -- python script.py

tgisaturday commented 3 years ago

Hey there! As you are trying to run on a TPU Pod, you would need to run
python -m torch_xla.distributed.xla_dist --tpu=$TPU_NAME -- python script.py

@kaushikb11 I've been running the code in distributed mode. This doesn't help.

kaushikb11 commented 3 years ago

@tgisaturday Could you provide more details? Lightning Version? Minimal example to reproduce the issue?

kaushikb11 commented 3 years ago

And also, where it seems to be failing.

tgisaturday commented 3 years ago

@kaushikb11 Here are test codes that I'm using: testcode.zip

I'm using pytorch-lightning 1.3.5.

python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name  -- python3 gan_test_pod.py 

python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python3 boring.py

I'm not sure where the boring.py fails, but my personal gan code seems to fail when the Trainer automatically tries to save checkpoints(trainer.save_checkpoint.py).

kaushikb11 commented 3 years ago

@tgisaturday

It shoulld be python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python boring.py

Boring script should be working.

tgisaturday commented 3 years ago

@kaushikb11 Have you ever tried with TPU VM v3-32? Boring script keeps throwing the error.

2021-06-11 00:34:32 10.164.0.7 [0] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 48 which is the number of cpus on this machine) in theDataLoaderinit to improve performance. 2021-06-11 00:34:32 10.164.0.7 [0] warnings.warn(*args, **kwargs) 2021-06-11 00:34:32 10.164.0.7 [0] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of thenum_workersargument (try 48 which is the number of cpus on this machine) in the DataLoader init to improve performance. 2021-06-11 00:34:32 10.164.0.7 [0] warnings.warn(*args, kwargs) Epoch 0: 100%|██████████| 2/2 [00:00<00:00, 15.80it/s, loss=1.79, v_num=0] 2021-06-11 00:34:32 10.164.0.13 [3] Exception in device=TPU:24: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt' 2021-06-11 00:34:32 10.164.0.22 [1] Exception in device=TPU:8: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt' 2021-06-11 00:34:32 10.164.0.8 [2] Exception in device=TPU:16: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt' 2021-06-11 00:34:32 10.164.0.13 [3] Traceback (most recent call last): 2021-06-11 00:34:32 10.164.0.22 [1] Traceback (most recent call last): 2021-06-11 00:34:32 10.164.0.8 [2] Traceback (most recent call last): 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn 2021-06-11 00:34:32 10.164.0.13 [3] _start_fn(index, pf_cfg, fn, args) 2021-06-11 00:34:32 10.164.0.22 [1] _start_fn(index, pf_cfg, fn, args) 2021-06-11 00:34:32 10.164.0.8 [2] _start_fn(index, pf_cfg, fn, args) 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn 2021-06-11 00:34:32 10.164.0.13 [3] fn(gindex, args) 2021-06-11 00:34:32 10.164.0.22 [1] fn(gindex, args) 2021-06-11 00:34:32 10.164.0.8 [2] fn(gindex, args) 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process 2021-06-11 00:34:32 10.164.0.13 [3] results = trainer.run_stage() 2021-06-11 00:34:32 10.164.0.22 [1] results = trainer.run_stage() 2021-06-11 00:34:32 10.164.0.8 [2] results = trainer.run_stage() 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage 2021-06-11 00:34:32 10.164.0.13 [3] return self.run_train() 2021-06-11 00:34:32 10.164.0.22 [1] return self.run_train() 2021-06-11 00:34:32 10.164.0.8 [2] return self.run_train() 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train 2021-06-11 00:34:32 10.164.0.13 [3] self.train_loop.run_training_epoch() 2021-06-11 00:34:32 10.164.0.22 [1] self.train_loop.run_training_epoch() 2021-06-11 00:34:32 10.164.0.8 [2] self.train_loop.run_training_epoch() 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch 2021-06-11 00:34:32 10.164.0.13 [3] self.trainer.run_evaluation(on_epoch=True) 2021-06-11 00:34:32 10.164.0.22 [1] self.trainer.run_evaluation(on_epoch=True) 2021-06-11 00:34:32 10.164.0.8 [2] self.trainer.run_evaluation(on_epoch=True) 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation 2021-06-11 00:34:32 10.164.0.13 [3] self.evaluation_loop.on_evaluation_end() 2021-06-11 00:34:32 10.164.0.22 [1] self.evaluation_loop.on_evaluation_end() 2021-06-11 00:34:32 10.164.0.8 [2] self.evaluation_loop.on_evaluation_end() 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end 2021-06-11 00:34:32 10.164.0.13 [3] self.trainer.call_hook('on_validation_end', args, kwargs) 2021-06-11 00:34:32 10.164.0.22 [1] self.trainer.call_hook('on_validation_end', *args, kwargs) 2021-06-11 00:34:32 10.164.0.8 [2] self.trainer.call_hook('on_validation_end', *args, *kwargs) 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook 2021-06-11 00:34:32 10.164.0.13 [3] trainer_hook(args, kwargs) 2021-06-11 00:34:32 10.164.0.22 [1] trainer_hook(*args, kwargs) 2021-06-11 00:34:32 10.164.0.8 [2] trainer_hook(*args, *kwargs) 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end 2021-06-11 00:34:32 10.164.0.13 [3] callback.on_validation_end(self, self.lightning_module) 2021-06-11 00:34:32 10.164.0.22 [1] callback.on_validation_end(self, self.lightning_module) 2021-06-11 00:34:32 10.164.0.8 [2] callback.on_validation_end(self, self.lightning_module) 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end 2021-06-11 00:34:32 10.164.0.13 [3] self.save_checkpoint(trainer) 2021-06-11 00:34:32 10.164.0.22 [1] self.save_checkpoint(trainer) 2021-06-11 00:34:32 10.164.0.8 [2] self.save_checkpoint(trainer) 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint 2021-06-11 00:34:32 10.164.0.13 [3] self._save_none_monitor_checkpoint(trainer, monitor_candidates) 2021-06-11 00:34:32 10.164.0.22 [1] self._save_none_monitor_checkpoint(trainer, monitor_candidates) 2021-06-11 00:34:32 10.164.0.8 [2] self._save_none_monitor_checkpoint(trainer, monitor_candidates) 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint 2021-06-11 00:34:32 10.164.0.13 [3] self._save_model(trainer, filepath) 2021-06-11 00:34:32 10.164.0.22 [1] self._save_model(trainer, filepath) 2021-06-11 00:34:32 10.164.0.8 [2] self._save_model(trainer, filepath) 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model 2021-06-11 00:34:32 10.164.0.13 [3] self._do_save(trainer, filepath) 2021-06-11 00:34:32 10.164.0.22 [1] self._do_save(trainer, filepath) 2021-06-11 00:34:32 10.164.0.8 [2] self._do_save(trainer, filepath) 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save 2021-06-11 00:34:32 10.164.0.13 [3] trainer.save_checkpoint(filepath, self.save_weights_only) 2021-06-11 00:34:32 10.164.0.22 [1] trainer.save_checkpoint(filepath, self.save_weights_only) 2021-06-11 00:34:32 10.164.0.8 [2] trainer.save_checkpoint(filepath, self.save_weights_only) 2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint 2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint 2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint 2021-06-11 00:34:32 10.164.0.13 [3] self.checkpoint_connector.save_checkpoint(filepath, weights_only) 2021-06-11 00:34:32 10.164.0.22 [1] self.checkpoint_connector.save_checkpoint(filepath, weights_only) 2021-06-11 00:34:32 10.164.0.8 [2] self.checkpoint_connector.save_checkpoint(filepath, weights_only) 2021-06-11 00:34:32 10.164.0.13 [3] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt' 2021-06-11 00:34:32 10.164.0.22 [1] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt' 2021-06-11 00:34:32 10.164.0.8 [2] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt' 2021-06-11 00:34:32 10.164.0.13 [3] https://symbolize.stripped_domain/r/?trace=7f358ea885ce,7f358e9ac20f,7f33a145ee81,7f3396fa9692,7f3396f984ea,7f3396f38b4b,7f35362c7e4a,515bd6f,2&map=c5ea6dcea9ec73900e238cf37efee14d75fd7749:7f339433b000-7f33a3ca9e28 2021-06-11 00:34:32 10.164.0.13 [3] SIGTERM received by PID 10397 (TID 10397) on cpu 15 from PID 9809; stack trace: ...

tgisaturday commented 3 years ago

@tgisaturday

It shoulld be python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python boring.py

Boring script should be working.

@kaushikb11 There was some typo in running command for boring script.

kaushikb11 commented 3 years ago

@tgisaturday Could you try the Lightning master?

tgisaturday commented 3 years ago

@kaushikb11 I'll try right away. I found out that boring script successfully runs with 'checkpoint_callback=False' flag.

kaushikb11 commented 3 years ago

@tgisaturday Awesome! It should be resolved. Also, if you face any more issues, feel free to ping me on Lightning Slack!

tgisaturday commented 3 years ago

@kaushikb11 Using lightning master (1.4.0dev) and saving checkpoint keeps throwing errors...

I've first tried to save checkpoints in current working directory. It throws 'file exists error'
Next, I've tried to save checkpoints in gcs bucket. It also throws 'file exists error'
I've assigned different root dir for each TPU worker by using os.environ["CLOUD_TPU_TASK_ID"] and 'file exists error' was resolved but the process still crashes with 'socket closed(14)'

I guess there's some problems with DDP accelerator when combined with TPU VM. I'm not sure if this is internal TPU VM problem or pytorch-lightning problem.

kaushikb11 commented 3 years ago

@tgisaturday I had recently trained minGPT on TPU VM Pod, and it worked as expected.

It throws 'file exists error'

Could you try deleting the logs and train it again? I had recently fixed a logging issue for GCS bucket.

Also, I am taking a stab at this issue in some time. Will update you regarding this. We will resolve this!! :)

tgisaturday commented 3 years ago

@tgisaturday I had recently trained minGPT on TPU VM Pod, and it worked as expected.

It throws 'file exists error'

Could you try deleting the logs and train it again? I had recently fixed a logging issue for GCS bucket.

Also, I am taking a stab at this issue in some time. Will update you regarding this. We will resolve this!! :)

@kaushikb11 Thank you for spending your time regarding this issue. I'll try training minGPT also.

tgisaturday commented 3 years ago

@kaushikb11 I removed logging[self.log(..)] from boring script and save_checkpoint works!

It's seems that logging is causing the problem.

kaushikb11 commented 3 years ago

@tgisaturday What were you logging?

tgisaturday commented 3 years ago

@tgisaturday What were you logging?

commented out every self.log from boring script.

tgisaturday commented 3 years ago

@kaushikb11 I've been refactoring taming-transformers to run the code on TPU VM.

Here's my code. taming-transformers-tpu

For easier debugging, I've also added fake_data feature. To start training with fake data, run the code by:

pip install -r requirements.txt

python main.py --use_tpus --fake_data

The code is working properly on single TPU Node or GPUs but seems to fall into deadlock on the initial stage of training on TPU VM.

76.7 M Trainable params 14.7 M Non-trainable params 91.5 M Total params 182.917 Total estimated model params size (MB) Epoch 0: 0%| | 0/456 [00:00<?, ?it/s]

Nothing goes further from here. Got any comments or suggestions? I'm not sure if this is TPU VM's internal problem or lightning's.

kaushikb11 commented 3 years ago

@tgisaturday What do you mean by single TPU Node here? Single TPU core or 8 TPU cores? Also, have you tried debugging at what point it goes into a deadlock?

tgisaturday commented 3 years ago

@tgisaturday What do you mean by single TPU Node here? Single TPU core or 8 TPU cores? Also, have you tried debugging at what point it goes into a deadlock?

Regarding single TPU Node, I've meant older way of using 8 core TPU (assign CPU vm and pair with TPU), not using newly released TPU VM.

Is there any way I can debug my code deeper than Trainer.fit? When I press ctrl+c, my code get's interupted somewhere around tpu spawn.

kaushikb11 commented 3 years ago

@tgisaturday Got it! Let me give it a try today.

tgisaturday commented 3 years ago

@tgisaturday Got it! Let me give it a try today.

I'm also closely interacting with GCP- side engineers. Please let me know if it is out of lightning's scope.

kaushikb11 commented 3 years ago

Going through your script. Also, note the effective batch size is batch size * 8 for 8 cores.

Getting this error when I did the first script run

  File "/home/kaushikbokka/.local/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 223, in log_metrics
    raise ValueError(m) from ex
ValueError:
 you tried to log -1 which is not currently supported. Try a dict or a scalar/tensor.

tgisaturday commented 3 years ago

@kaushikb11 I've resolved single tpu vm issue with the walkaround suggested in #8183. While everything is okay with single TPU VM, I'm still trying to solve logging issue with TPU VM Pod. With my revised taming-transformers-tpu code, the progress bar doesn't appear at all. Since the trainer itself works this seems to be progress bar logging issue with distributed training on tpu vm pod. Any suggestions on where to start looking with pytorch-lightning repo? Ping me on lightning slack if you need to.

kaushikb11 commented 3 years ago

@tgisaturday Yup, I took a look into it. The issue is that the progress bar only appears after it's finished. It's not exactly Lightning but tqdm specific, but we definitely need to figure it out.

Here, you could see how pytorch_xla.distributed streams logs from different vms to the master worker. https://github.com/pytorch/xla/blob/master/torch_xla/distributed/xla_dist.py#L140

Guessing, it doesn't play well with tqdm.

Sample script to reproduce the issue & to fix the issue https://github.com/kaushikb11/minGPT/blob/master/tqdm_test.py

Would appreciate it if you could give it a look as well on how we could fix it.

kaushikb11 commented 3 years ago

Closing this issue, as it has been resolved by #8258 :)

Lightning-AI / pytorch-lightning