Question about model and model_without_ddp

jaehyungjung commented 4 months ago

Hi, thank you very much for your great work!

I have a question about model and model_without_ddp in main_*.py scripts. (I'm new to pytorch and distributed data parallel.)

I'm trying to train your network with new dataset, and my setting has --resume some_path and --distributed False. When I follow your main_depth.py script, I found that model is initialized, and model_without_ddp loads the checkpoint from resume. So something like below,

model = Unimatch(...)
model_without_ddp = model
optimizer = torch.optim.AdamW(model_without_ddp.parameters(), ...)
model_without_ddp.load_state_dict(resume ...)

Then, it seems that during epochs, model_without_ddp is actually trained because optimizer is initialized with model_without_ddp.

My question is why do we have two models (model and model_without_ddp) in this case? How do they sync to each other? Is the assignment actually pointer?

Thanks!

haofeixu commented 4 months ago

Hi, when using distributed training, pytorch will internally creates a new module where the original model is stored in a module attribute. Accessing model.module allows you to interact with the original model directly. It's useful for operations that need to work on the model directly, such as saving the model weights.

jaehyungjung commented 4 months ago

Thank you very much for your quick reply!

autonomousvision / unimatch

Question about model and model_without_ddp #49