Closed ibeltagy closed 4 years ago
@lezwon mind have look?
WIll do :]
@ibeltagy mind check it with the latest master?
@Borda, couldn't test it because it appears that master is broken. It runs for a few steps then crashes with the following error. The only thing I changed is switching from release v0.8.5 to master, everything else is the same.
Epoch 1: 1%|█▏ | 10/1066 [02:07<3:45:12, 12.80s/it, loss=1.645, v_num=0]
Traceback (most recent call last):
File "scripts/pretrain.py", line 488, in <module>
main(args)
File "scripts/pretrain.py", line 482, in main
trainer.fit(pretrainer)
File "/home/beltagy/pytorch-lightning/pytorch_lightning/trainer/states.py", line 34, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/home/beltagy/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1059, in fit
self.accelerator_backend.train(model)
File "/home/beltagy/pytorch-lightning/pytorch_lightning/accelerators/tpu_backend.py", line 87, in train
start_method=self.start_method
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 387, in spawn
_start_fn(0, pf_cfg, fn, args)
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
fn(gindex, *args)
File "/home/beltagy/pytorch-lightning/pytorch_lightning/accelerators/tpu_backend.py", line 118, in tpu_train_in_process
trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
File "/home/beltagy/pytorch-lightning/pytorch_lightning/trainer/distrib_data_parallel.py", line 417, in transfer_distrib_spawn_state_on_fit_end
if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
trying to reproduce with colab tests, but actually I have a problem installing XLA, @zcain117 ? https://colab.research.google.com/drive/15E3oo3vPsSvLVufU4I6AK2thcXcg2mdW#scrollTo=BHBz1_AnamN_
@lezwon mind taking a look?
I haven't been able to reproduce this issue. Maybe if @ibeltagy could share a notebook I could look into it. :)
Are you testing with num_tpu_cores=8
? do you get a single file under the tf
directory?
the test is here, but not counting files https://github.com/PyTorchLightning/pytorch-lightning/blob/c94c0a2b1ee6b444ab1ecf58059e922229d44436/tests/models/test_tpu.py#L74-L88
The test looks good but it is not capturing the issue that multiple tensorboard log files are created and that these files confuse tensoboard
Are you testing with
num_tpu_cores=8
? do you get a single file under thetf
directory?
Yea. I didn't see multiple files.
@ibeltagy mind share a notebook so we can try to reproduce? or see if this issue is still happening on master?
closing this issue for now. Feel free to reopen with an example so we can reproduce!
🐛 Bug
With TPUs,
TestTubeLogger
writes many empty tensorboard logs, one log per TPU core except one. This confuses tensorboard and prevents it from updating. This is happening because the logger is created before spawning processes then the logger is replicated in each process.To Reproduce
Train any model with
ptl.Trainer(logger=TestTubeLogger(), num_tpu_cores=8)
then check thetf
directory, you will find 1 file with real log and 7 empty files.Expected behavior
Only the main process creates a tensorboard log.
Environment
pytorch-lightning==0.8.5