Closed dvirginz closed 10 months ago
@dvirginz This is likely just the first of a few errors that you would get if you pickle an Experiment object. Now the question is: why is ddp attempting to pickle the Experiment when the logger is enabled?
We're looking into this.
The way pl is working on multi node env is to pickle and send the whole model. I assume that when you guys are running on ddp envs the experiment object stayes on the main process? If so, how can you log per process events? (for example specific samples that happened to end up at gpu x that we would like to log?)
Anyhow, thanks for looking into that 🙂
@dvirginz Experiments don't all have to be on the main process. For example, we have our own runner which can be used with or without our Optimizer, and you can coordinate which GPU a process runs on. For more details on that see: https://www.comet.ml/docs/command-line/#comet-optimize
In general, the Experiment object wasn't designed to be pickled as it has live connections to the server. You can work around that though. For example, there is the ExistingExperiment, but also you can just delay creating the Experiment until you are in the process or thread.
Let me know if you would like more information on any of the above.
So if I understand correctly, you suggest not to use comet-pytorch_lightning logging infrastructure, but for example, pass the ExistingExperiment id as a hyperparameter, create the logging object inside the thread, and take care of the logging myself(instead of the automatic logging using CometLogger
object)?
I'll try that and update.
@dvirginz Could you post your Trainer parameters?
Hi @dsblank , we talked about it on slack too. Even the simplest configuration it happens, as each distributed process creates a new experiment.
trainer = pl.Trainer(
logger= CometLogger(
api_key="ID",
workspace="User",
project_name="proj",
experiment_key=experiment_id,
),
auto_select_gpus=True,
gpus=3,
distributed_backend="ddp",
)
(I'm still on 3.1.8
)
@dvirginz Update: we are preparing some documentation to help wrestle with this issue. Here is one example with some hints that may help: https://github.com/comet-ml/comet-examples/tree/master/pytorch#using-cometml-with-pytorch-parallel-data-training
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.
I have the following problem running on ddp mode with cometlogger. When I detach the logger from the trainer (i.e deleting
logger=comet_logger
) the code runs.