Closed tgisaturday closed 3 years ago
Hey there! As you are trying to run on a TPU Pod, you would need to run
python -m torch_xla.distributed.xla_dist --tpu=$TPU_NAME -- python script.py
Hey there! As you are trying to run on a TPU Pod, you would need to run
python -m torch_xla.distributed.xla_dist --tpu=$TPU_NAME -- python script.py
@kaushikb11 I've been running the code in distributed mode. This doesn't help.
@tgisaturday Could you provide more details? Lightning Version? Minimal example to reproduce the issue?
And also, where it seems to be failing.
@kaushikb11 Here are test codes that I'm using: testcode.zip
I'm using pytorch-lightning 1.3.5.
python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python3 gan_test_pod.py
python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python3 boring.py
I'm not sure where the boring.py fails, but my personal gan code seems to fail when the Trainer automatically tries to save checkpoints(trainer.save_checkpoint.py).
@tgisaturday
It shoulld be python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python boring.py
Boring script should be working.
@kaushikb11 Have you ever tried with TPU VM v3-32? Boring script keeps throwing the error.
2021-06-11 00:34:32 10.164.0.7 [0] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers
argument(try 48 which is the number of cpus on this machine) in the
DataLoaderinit to improve performance. 2021-06-11 00:34:32 10.164.0.7 [0] warnings.warn(*args, **kwargs) 2021-06-11 00:34:32 10.164.0.7 [0] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the
num_workersargument
(try 48 which is the number of cpus on this machine) in the DataLoader
init to improve performance.
2021-06-11 00:34:32 10.164.0.7 [0] warnings.warn(*args, kwargs)
Epoch 0: 100%|██████████| 2/2 [00:00<00:00, 15.80it/s, loss=1.79, v_num=0]
2021-06-11 00:34:32 10.164.0.13 [3] Exception in device=TPU:24: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.22 [1] Exception in device=TPU:8: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.8 [2] Exception in device=TPU:16: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.13 [3] Traceback (most recent call last):
2021-06-11 00:34:32 10.164.0.22 [1] Traceback (most recent call last):
2021-06-11 00:34:32 10.164.0.8 [2] Traceback (most recent call last):
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
2021-06-11 00:34:32 10.164.0.13 [3] _start_fn(index, pf_cfg, fn, args)
2021-06-11 00:34:32 10.164.0.22 [1] _start_fn(index, pf_cfg, fn, args)
2021-06-11 00:34:32 10.164.0.8 [2] _start_fn(index, pf_cfg, fn, args)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
2021-06-11 00:34:32 10.164.0.13 [3] fn(gindex, args)
2021-06-11 00:34:32 10.164.0.22 [1] fn(gindex, args)
2021-06-11 00:34:32 10.164.0.8 [2] fn(gindex, args)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process
2021-06-11 00:34:32 10.164.0.13 [3] results = trainer.run_stage()
2021-06-11 00:34:32 10.164.0.22 [1] results = trainer.run_stage()
2021-06-11 00:34:32 10.164.0.8 [2] results = trainer.run_stage()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
2021-06-11 00:34:32 10.164.0.13 [3] return self.run_train()
2021-06-11 00:34:32 10.164.0.22 [1] return self.run_train()
2021-06-11 00:34:32 10.164.0.8 [2] return self.run_train()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
2021-06-11 00:34:32 10.164.0.13 [3] self.train_loop.run_training_epoch()
2021-06-11 00:34:32 10.164.0.22 [1] self.train_loop.run_training_epoch()
2021-06-11 00:34:32 10.164.0.8 [2] self.train_loop.run_training_epoch()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch
2021-06-11 00:34:32 10.164.0.13 [3] self.trainer.run_evaluation(on_epoch=True)
2021-06-11 00:34:32 10.164.0.22 [1] self.trainer.run_evaluation(on_epoch=True)
2021-06-11 00:34:32 10.164.0.8 [2] self.trainer.run_evaluation(on_epoch=True)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation
2021-06-11 00:34:32 10.164.0.13 [3] self.evaluation_loop.on_evaluation_end()
2021-06-11 00:34:32 10.164.0.22 [1] self.evaluation_loop.on_evaluation_end()
2021-06-11 00:34:32 10.164.0.8 [2] self.evaluation_loop.on_evaluation_end()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end
2021-06-11 00:34:32 10.164.0.13 [3] self.trainer.call_hook('on_validation_end', args, kwargs)
2021-06-11 00:34:32 10.164.0.22 [1] self.trainer.call_hook('on_validation_end', *args, kwargs)
2021-06-11 00:34:32 10.164.0.8 [2] self.trainer.call_hook('on_validation_end', *args, *kwargs)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
2021-06-11 00:34:32 10.164.0.13 [3] trainer_hook(args, kwargs)
2021-06-11 00:34:32 10.164.0.22 [1] trainer_hook(*args, kwargs)
2021-06-11 00:34:32 10.164.0.8 [2] trainer_hook(*args, *kwargs)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end
2021-06-11 00:34:32 10.164.0.13 [3] callback.on_validation_end(self, self.lightning_module)
2021-06-11 00:34:32 10.164.0.22 [1] callback.on_validation_end(self, self.lightning_module)
2021-06-11 00:34:32 10.164.0.8 [2] callback.on_validation_end(self, self.lightning_module)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end
2021-06-11 00:34:32 10.164.0.13 [3] self.save_checkpoint(trainer)
2021-06-11 00:34:32 10.164.0.22 [1] self.save_checkpoint(trainer)
2021-06-11 00:34:32 10.164.0.8 [2] self.save_checkpoint(trainer)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint
2021-06-11 00:34:32 10.164.0.13 [3] self._save_none_monitor_checkpoint(trainer, monitor_candidates)
2021-06-11 00:34:32 10.164.0.22 [1] self._save_none_monitor_checkpoint(trainer, monitor_candidates)
2021-06-11 00:34:32 10.164.0.8 [2] self._save_none_monitor_checkpoint(trainer, monitor_candidates)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint
2021-06-11 00:34:32 10.164.0.13 [3] self._save_model(trainer, filepath)
2021-06-11 00:34:32 10.164.0.22 [1] self._save_model(trainer, filepath)
2021-06-11 00:34:32 10.164.0.8 [2] self._save_model(trainer, filepath)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model
2021-06-11 00:34:32 10.164.0.13 [3] self._do_save(trainer, filepath)
2021-06-11 00:34:32 10.164.0.22 [1] self._do_save(trainer, filepath)
2021-06-11 00:34:32 10.164.0.8 [2] self._do_save(trainer, filepath)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save
2021-06-11 00:34:32 10.164.0.13 [3] trainer.save_checkpoint(filepath, self.save_weights_only)
2021-06-11 00:34:32 10.164.0.22 [1] trainer.save_checkpoint(filepath, self.save_weights_only)
2021-06-11 00:34:32 10.164.0.8 [2] trainer.save_checkpoint(filepath, self.save_weights_only)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint
2021-06-11 00:34:32 10.164.0.13 [3] self.checkpoint_connector.save_checkpoint(filepath, weights_only)
2021-06-11 00:34:32 10.164.0.22 [1] self.checkpoint_connector.save_checkpoint(filepath, weights_only)
2021-06-11 00:34:32 10.164.0.8 [2] self.checkpoint_connector.save_checkpoint(filepath, weights_only)
2021-06-11 00:34:32 10.164.0.13 [3] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.22 [1] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.8 [2] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.13 [3] https://symbolize.stripped_domain/r/?trace=7f358ea885ce,7f358e9ac20f,7f33a145ee81,7f3396fa9692,7f3396f984ea,7f3396f38b4b,7f35362c7e4a,515bd6f,2&map=c5ea6dcea9ec73900e238cf37efee14d75fd7749:7f339433b000-7f33a3ca9e28
2021-06-11 00:34:32 10.164.0.13 [3] SIGTERM received by PID 10397 (TID 10397) on cpu 15 from PID 9809; stack trace: ...
@tgisaturday
It shoulld be
python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python boring.py
Boring script should be working.
@kaushikb11 There was some typo in running command for boring script.
@tgisaturday Could you try the Lightning master?
@kaushikb11 I'll try right away. I found out that boring script successfully runs with 'checkpoint_callback=False' flag.
@tgisaturday Awesome! It should be resolved. Also, if you face any more issues, feel free to ping me on Lightning Slack!
@kaushikb11 Using lightning master (1.4.0dev) and saving checkpoint keeps throwing errors...
I guess there's some problems with DDP accelerator when combined with TPU VM. I'm not sure if this is internal TPU VM problem or pytorch-lightning problem.
@tgisaturday I had recently trained minGPT on TPU VM Pod, and it worked as expected.
It throws 'file exists error'
Could you try deleting the logs and train it again? I had recently fixed a logging issue for GCS bucket.
Also, I am taking a stab at this issue in some time. Will update you regarding this. We will resolve this!! :)
@tgisaturday I had recently trained minGPT on TPU VM Pod, and it worked as expected.
It throws 'file exists error'
Could you try deleting the logs and train it again? I had recently fixed a logging issue for GCS bucket.
Also, I am taking a stab at this issue in some time. Will update you regarding this. We will resolve this!! :)
@kaushikb11 Thank you for spending your time regarding this issue. I'll try training minGPT also.
@kaushikb11 I removed logging[self.log(..)] from boring script and save_checkpoint works!
It's seems that logging is causing the problem.
@tgisaturday What were you logging?
@tgisaturday What were you logging?
commented out every self.log from boring script.
@kaushikb11 I've been refactoring taming-transformers to run the code on TPU VM.
Here's my code. taming-transformers-tpu
For easier debugging, I've also added fake_data feature. To start training with fake data, run the code by:
pip install -r requirements.txt
python main.py --use_tpus --fake_data
The code is working properly on single TPU Node or GPUs but seems to fall into deadlock on the initial stage of training on TPU VM.
76.7 M Trainable params 14.7 M Non-trainable params 91.5 M Total params 182.917 Total estimated model params size (MB) Epoch 0: 0%| | 0/456 [00:00<?, ?it/s]
Nothing goes further from here. Got any comments or suggestions? I'm not sure if this is TPU VM's internal problem or lightning's.
@tgisaturday What do you mean by single TPU Node here? Single TPU core or 8 TPU cores? Also, have you tried debugging at what point it goes into a deadlock?
@tgisaturday What do you mean by single TPU Node here? Single TPU core or 8 TPU cores? Also, have you tried debugging at what point it goes into a deadlock?
Regarding single TPU Node, I've meant older way of using 8 core TPU (assign CPU vm and pair with TPU), not using newly released TPU VM.
Is there any way I can debug my code deeper than Trainer.fit? When I press ctrl+c, my code get's interupted somewhere around tpu spawn.
@tgisaturday Got it! Let me give it a try today.
@tgisaturday Got it! Let me give it a try today.
I'm also closely interacting with GCP- side engineers. Please let me know if it is out of lightning's scope.
Going through your script. Also, note the effective batch size is batch size * 8
for 8 cores.
Getting this error when I did the first script run
File "/home/kaushikbokka/.local/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 223, in log_metrics
raise ValueError(m) from ex
ValueError:
you tried to log -1 which is not currently supported. Try a dict or a scalar/tensor.
@kaushikb11 I've resolved single tpu vm issue with the walkaround suggested in #8183. While everything is okay with single TPU VM, I'm still trying to solve logging issue with TPU VM Pod. With my revised taming-transformers-tpu code, the progress bar doesn't appear at all. Since the trainer itself works this seems to be progress bar logging issue with distributed training on tpu vm pod. Any suggestions on where to start looking with pytorch-lightning repo? Ping me on lightning slack if you need to.
@tgisaturday Yup, I took a look into it. The issue is that the progress bar only appears after it's finished. It's not exactly Lightning but tqdm specific, but we definitely need to figure it out.
Here, you could see how pytorch_xla.distributed
streams logs from different vms to the master worker.
https://github.com/pytorch/xla/blob/master/torch_xla/distributed/xla_dist.py#L140
Guessing, it doesn't play well with tqdm.
Sample script to reproduce the issue & to fix the issue https://github.com/kaushikb11/minGPT/blob/master/tqdm_test.py
Would appreciate it if you could give it a look as well on how we could fix it.
Closing this issue, as it has been resolved by #8258 :)
🐛 Bug
Please reproduce using the BoringModel
Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer. While running code on Google Cloud TPU VM Pod v3-8 successfully runs, process crashes on Google Cloud TPU VM Pod v3-32 (not Pod Node).
To Reproduce
Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer (for TPU support).
Expected behavior
Run without crash on v3-32.
Environment
Note:
Bugs with code
are solved faster !Colab Notebook
should be madepublic
!IDE
: Please, use our python bug_report_model.py template.Colab Notebook
: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).You can get the script and run it with:
TPU VM Pod Software: v2-alpha
conda
,pip
, source): bulit-in image in v2-alphaAdditional context
I've been also testing simple MNIST GAN code and same problem appears. My custom code crashes when Trainer.fit() automatically tries to save checkpoints with trainer.save_checkpoint. Here are test codes that I've used. testcode.zip