Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.03k stars 3.36k forks source link

Exception: The wandb backend process has shutdown #10688

Closed morestart closed 2 years ago

morestart commented 2 years ago

🐛 Bug

Exception: The wandb backend process has shutdown

full error info:

Traceback (most recent call last):
  File "/home/cat/PycharmProjects/torch-ocr/tools/train/det_train/train.py", line 59, in <module>
    trainer.fit(model, data)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in fit
    self._call_and_handle_interrupt(
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1193, in _run
    self._dispatch()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1272, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1282, in run_stage
    return self._run_train()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1312, in _run_train
    self.fit_loop.run()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 232, in advance
    self.trainer.logger_connector.update_train_step_metrics()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 225, in update_train_step_metrics
    self.log_metrics(self.metrics["log"])
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 121, in log_metrics
    self.trainer.logger.save()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 427, in save
    logger.save()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 317, in save
    self._finalize_agg_metrics()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 152, in _finalize_agg_metrics
    self.log_metrics(metrics=metrics_to_log, step=agg_step)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 49, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 370, in log_metrics
    self.experiment.log({**metrics, "trainer/global_step": step})
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 43, in experiment
    return get_experiment() or DummyExperiment()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 49, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in get_experiment
    return fn(self)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 349, in experiment
    self._experiment.define_metric("trainer/global_step")
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 2195, in define_metric
    m._commit()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_metric.py", line 117, in _commit
    self._callback(m)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 933, in _metric_callback
    self._backend.interface._publish_metric(metric_record)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 309, in _publish_metric
    self._publish(rec)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 223, in _publish
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

wandb: Waiting for W&B process to finish, PID 7468... (failed 1). Press ctrl-c to abort syncing.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1671, in _atexit_cleanup
    self._on_finish()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1844, in _on_finish
    self._backend.interface._publish_telemetry(self._telemetry_obj)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 82, in _publish_telemetry
    self._publish(rec)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 223, in _publish
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1680, in _atexit_cleanup
    self._backend.cleanup()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/backend/backend.py", line 228, in cleanup
    self.interface.join()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 481, in join
    super(InterfaceQueue, self).join()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 591, in join
    self._communicate_shutdown()
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 478, in _communicate_shutdown
    _ = self._communicate(record)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 232, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "/home/cat/miniconda3/envs/torch-ocr/lib/python3.8/site-packages/wandb/sdk/interface/interface_queue.py", line 237, in _communicate_async
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

进程已结束,退出代码为 1

To Reproduce

Expected behavior

Environment

cc @awaelchli @morganmcg1 @AyushExel @borisdayma @scottire

morganmcg1 commented 2 years ago

@morestart can you please share some code to reproduce this? And describe what you were trying to log (e.g. what task/data type etc)

Also make sure you are on the latest version of wandb: pip install wandb --upgrade

morestart commented 2 years ago

this is my code @morganmcg1,my wandb version is 0.12.6,This error occurs in epoch 69 now i run a new program,It works well... the error info is too little, i don't know why..

class DetModel(pl.LightningModule):

    def forward(self, x):
        features = self.encoder(x)
        features = self.neck(features)
        features = self.head(features)

        return features

    def training_step(self, batch, batch_idx):
        data = batch
        output = self.forward(data['img'])
        loss_dict = self.loss_func(output, batch)

        self.log(name=self.train_loss_name, value=loss_dict['loss'])
        self.log(name='shrink_maps', value=loss_dict['loss_shrink_maps'])
        self.log(name='threshold_maps', value=loss_dict['loss_threshold_maps'])
        self.log(name='binary_maps', value=loss_dict['loss_binary_maps'])

        return loss_dict['loss']

    def validation_step(self, batch, batch_idx):
        data = batch

        output = self.forward(data['img'])
        boxes, scores = self.postprocess(output.cpu().numpy(), batch['shape'])
        raw_metric = self.metric(batch, (boxes, scores))

        return raw_metric

    def validation_epoch_end(self, outputs):
        metric = self.metric.gather_measure(outputs)
        self.log('recall', value=metric['recall'].avg)
        self.log('precision', value=metric['precision'].avg)
        self.log('hmean', value=metric['fmeasure'].avg)
        return {'hmean': metric['fmeasure'].avg}

    def configure_optimizers(self):
        optimizer = get_optimizer(self.parameters(), self.optimizer_name, self.lr)
        lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max')
        return {'optimizer': optimizer, 'lr_scheduler': lr_scheduler, "monitor": 'hmean'}
stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

borisdayma commented 2 years ago

This should now be fixed. Could you update Pytorch Lightning from master branch and install the most recent version of wandb?

pip install --upgrade wandb
pip install --upgrade git+https://github.com/PytorchLightning/pytorch-lightning.git
morestart commented 2 years ago

ok i will try it and close this issue~thanks for your work!