ChenWu98 / cycle-diffusion

[ICCV 2023] A latent space for stochastic diffusion models
Other
575 stars 35 forks source link

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 #7

Closed chenkai-666 closed 2 years ago

chenkai-666 commented 2 years ago

Thank you for your excellent work. I have encountered some problems. Can you tell me how to correct them?

image

ChenWu98 commented 2 years ago

It's really weird... In trainer.py, we have these lines of code:

self.model = nn.parallel.DistributedDataParallel(
            self.model,
            device_ids=[self.args.local_rank],
            output_device=self.args.local_rank,
            find_unused_parameters=self.args.ddp_find_unused_parameters,
        )

So self.model should be an nn.parallel.DistributedDataParallel instance instead of an nn.Module instance; however, in your log, it is an nn.Module instance. I haven't encountered this bug; maybe it is related to CPU/GPU settings.

A quick solution is: in trainer.py there are two mentions of self.model.module, and you can change them to None because they are not used anyway. Also, as I mentioned in #6, a version based on the diffusers library will be available soon, which will be easier to use. Hope this is helpful!