comet-ml / issue-tracking

Questions, Help, and Issues for Comet ML
https://www.comet.ml
85 stars 7 forks source link

Error running on ddp (can't pickle local object 'SummaryTopic) with comet logger (pytorch lightning) #352

Closed dvirginz closed 10 months ago

dvirginz commented 4 years ago

I have the following problem running on ddp mode with cometlogger. When I detach the logger from the trainer (i.e deletinglogger=comet_logger) the code runs.

Exception has occurred: AttributeError
Can't pickle local object 'SummaryTopic.__init__.<locals>.default'
  File "/path/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/path/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/path/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/path/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/path/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/path/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/path/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/path/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/repo_path/train.py", line 158, in main_train
    trainer.fit(model)
  File "/repo_path/train.py", line 72, in main
    main_train(model_class_pointer, hyperparams, logger)
  File "/repo_path/train.py", line 167, in <module>
    main()
  File "/path/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/path/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/path/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/path/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/path/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
dsblank commented 4 years ago

@dvirginz This is likely just the first of a few errors that you would get if you pickle an Experiment object. Now the question is: why is ddp attempting to pickle the Experiment when the logger is enabled?

We're looking into this.

dvirginz commented 4 years ago

The way pl is working on multi node env is to pickle and send the whole model. I assume that when you guys are running on ddp envs the experiment object stayes on the main process? If so, how can you log per process events? (for example specific samples that happened to end up at gpu x that we would like to log?)

Anyhow, thanks for looking into that 🙂

dsblank commented 4 years ago

@dvirginz Experiments don't all have to be on the main process. For example, we have our own runner which can be used with or without our Optimizer, and you can coordinate which GPU a process runs on. For more details on that see: https://www.comet.ml/docs/command-line/#comet-optimize

In general, the Experiment object wasn't designed to be pickled as it has live connections to the server. You can work around that though. For example, there is the ExistingExperiment, but also you can just delay creating the Experiment until you are in the process or thread.

Let me know if you would like more information on any of the above.

dvirginz commented 4 years ago

So if I understand correctly, you suggest not to use comet-pytorch_lightning logging infrastructure, but for example, pass the ExistingExperiment id as a hyperparameter, create the logging object inside the thread, and take care of the logging myself(instead of the automatic logging using CometLogger object)?

I'll try that and update.

dsblank commented 4 years ago

@dvirginz Could you post your Trainer parameters?

dvirginz commented 4 years ago

Hi @dsblank , we talked about it on slack too. Even the simplest configuration it happens, as each distributed process creates a new experiment.

trainer = pl.Trainer(
        logger= CometLogger(
        api_key="ID",
        workspace="User",
        project_name="proj",
        experiment_key=experiment_id,
    ),
        auto_select_gpus=True,
        gpus=3,
        distributed_backend="ddp",
   )

(I'm still on 3.1.8)

dsblank commented 4 years ago

@dvirginz Update: we are preparing some documentation to help wrestle with this issue. Here is one example with some hints that may help: https://github.com/comet-ml/comet-examples/tree/master/pytorch#using-cometml-with-pytorch-parallel-data-training

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 10 months ago

This issue was closed because it has been stalled for 5 days with no activity.