Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.38k forks source link

All TPU cores create tensorboard logs #2698

Closed ibeltagy closed 4 years ago

ibeltagy commented 4 years ago

🐛 Bug

With TPUs, TestTubeLogger writes many empty tensorboard logs, one log per TPU core except one. This confuses tensorboard and prevents it from updating. This is happening because the logger is created before spawning processes then the logger is replicated in each process.

To Reproduce

Train any model with ptl.Trainer(logger=TestTubeLogger(), num_tpu_cores=8) then check the tf directory, you will find 1 file with real log and 7 empty files.

Expected behavior

Only the main process creates a tensorboard log.

Environment

pytorch-lightning==0.8.5

Borda commented 4 years ago

@lezwon mind have look?

lezwon commented 4 years ago

WIll do :]

Borda commented 4 years ago

@ibeltagy mind check it with the latest master?

ibeltagy commented 4 years ago

@Borda, couldn't test it because it appears that master is broken. It runs for a few steps then crashes with the following error. The only thing I changed is switching from release v0.8.5 to master, everything else is the same.

Epoch 1:   1%|█▏                                                                                                                                   | 10/1066 [02:07<3:45:12, 12.80s/it, loss=1.645, v_num=0]
Traceback (most recent call last):
  File "scripts/pretrain.py", line 488, in <module>
    main(args)
  File "scripts/pretrain.py", line 482, in main
    trainer.fit(pretrainer)
  File "/home/beltagy/pytorch-lightning/pytorch_lightning/trainer/states.py", line 34, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/beltagy/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1059, in fit
    self.accelerator_backend.train(model)
  File "/home/beltagy/pytorch-lightning/pytorch_lightning/accelerators/tpu_backend.py", line 87, in train
    start_method=self.start_method
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 387, in spawn
    _start_fn(0, pf_cfg, fn, args)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/home/beltagy/pytorch-lightning/pytorch_lightning/accelerators/tpu_backend.py", line 118, in tpu_train_in_process
    trainer.transfer_distrib_spawn_state_on_fit_end(model, mp_queue, results)
  File "/home/beltagy/pytorch-lightning/pytorch_lightning/trainer/distrib_data_parallel.py", line 417, in transfer_distrib_spawn_state_on_fit_end
    if self.distributed_backend.lower() not in ['ddp_spawn', 'ddp_cpu', 'tpu']:
AttributeError: 'NoneType' object has no attribute 'lower'
Borda commented 4 years ago

trying to reproduce with colab tests, but actually I have a problem installing XLA, @zcain117 ? https://colab.research.google.com/drive/15E3oo3vPsSvLVufU4I6AK2thcXcg2mdW#scrollTo=BHBz1_AnamN_

edenlightning commented 4 years ago

@lezwon mind taking a look?

lezwon commented 4 years ago

I haven't been able to reproduce this issue. Maybe if @ibeltagy could share a notebook I could look into it. :)

ibeltagy commented 4 years ago

Are you testing with num_tpu_cores=8? do you get a single file under the tf directory?

Borda commented 4 years ago

the test is here, but not counting files https://github.com/PyTorchLightning/pytorch-lightning/blob/c94c0a2b1ee6b444ab1ecf58059e922229d44436/tests/models/test_tpu.py#L74-L88

ibeltagy commented 4 years ago

The test looks good but it is not capturing the issue that multiple tensorboard log files are created and that these files confuse tensoboard

lezwon commented 4 years ago

Are you testing with num_tpu_cores=8? do you get a single file under the tf directory?

Yea. I didn't see multiple files.

edenlightning commented 4 years ago

@ibeltagy mind share a notebook so we can try to reproduce? or see if this issue is still happening on master?

edenlightning commented 4 years ago

closing this issue for now. Feel free to reopen with an example so we can reproduce!