Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.38k forks source link

DDP breaks LR finder #1831

Closed s-rog closed 4 years ago

s-rog commented 4 years ago

🐛 Bug

DDP breaks LR finder

To Reproduce

finder = trainer.lr_find(model)
print(finder.suggestion())
Traceback (most recent call last):
  File "./training.py", line 107, in <module>
    main(hparam_trial)
  File "./training.py", line 97, in main
    finder = trainer.lr_find(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/lr_finder.py", line 153, in lr_find
    self.fit(model, train_dataloader=train_dataloader)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_LRFinder._get_new_optimizer.<locals>.configure_optimizers'

At first I thought it's because configure_optimizers returns [opt], [sched] but returning opt still causes the error. Training works correctly with the same code.

williamFalcon commented 4 years ago

@SkafteNicki

Alikerin commented 4 years ago

I also face a similar issue with Tensorboard logger whenever the logger flag is left as default both on GPU and TPU colab runtime. It throws the following exception on TPU runtime

Exception in device=TPU:0: dictionary update sequence element #0 has length 1; 2 is required
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 531, in tpu_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 980, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 465, in run_training_epoch
    self.log_metrics(batch_step_metrics, grad_norm_dic)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/logging.py", line 74, in log_metrics
    self.logger.save()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py", line 10, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/loggers/tensorboard.py", line 161, in save
    save_hparams_to_yaml(hparams_file, self.hparams)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/core/saving.py", line 151, in save_hparams_to_yaml
    yaml.dump(hparams, fp)
  File "/usr/local/lib/python3.6/dist-packages/yaml/__init__.py", line 200, in dump
    return dump_all([data], stream, Dumper=Dumper, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/yaml/__init__.py", line 188, in dump_all
    dumper.represent(data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 26, in represent
    node = self.represent_data(data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 47, in represent_data
    node = self.yaml_representers[data_types[0]](self, data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 205, in represent_dict
    return self.represent_mapping('tag:yaml.org,2002:map', data)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 116, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/usr/local/lib/python3.6/dist-packages/yaml/representer.py", line 51, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
ValueError: dictionary update sequence element #0 has length 1; 2 is required

Similarly, on GPU runtime it throws an exception saying can't pickle _thread.lock objects. I resolve the issue by setting logger=False