Closed jaehyungjung closed 4 months ago
Hi, when using distributed training, pytorch will internally creates a new module where the original model is stored in a module
attribute. Accessing model.module
allows you to interact with the original model directly. It's useful for operations that need to work on the model directly, such as saving the model weights.
Thank you very much for your quick reply!
Hi, thank you very much for your great work!
I have a question about
model
andmodel_without_ddp
in main_*.py scripts. (I'm new to pytorch and distributed data parallel.)I'm trying to train your network with new dataset, and my setting has
--resume some_path
and--distributed False
. When I follow yourmain_depth.py
script, I found thatmodel
is initialized, andmodel_without_ddp
loads the checkpoint from resume. So something like below,Then, it seems that during epochs,
model_without_ddp
is actually trained becauseoptimizer
is initialized withmodel_without_ddp
.My question is why do we have two models (
model
andmodel_without_ddp
) in this case? How do they sync to each other? Is the assignment actually pointer?Thanks!