XavierXiao / Dreambooth-Stable-Diffusion

Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) with Stable Diffusion
MIT License
7.6k stars 795 forks source link

[Windows]: RuntimeError: Distributed package doesn't have NCCL built in #65

Closed Tuxius closed 2 years ago

Tuxius commented 2 years ago

Under Windows I get the error message: RuntimeError: Distributed package doesn't have NCCL built in

Traceback (most recent call last): File "main.py", line 830, in <module> trainer.fit(model, data) File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 740, in fit self._call_and_handle_interrupt( File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 685, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 777, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1137, in _run self.accelerator.setup_environment() File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\pytorch_lightning\accelerators\gpu.py", line 39, in setup_environment super().setup_environment() File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 83, in setup_environment self.training_type_plugin.setup_environment() File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 185, in setup_environment self.setup_distributed() File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 272, in setup_distributed init_dist_connection(self.cluster_environment, self.torch_distributed_backend) File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\pytorch_lightning\utilities\distributed.py", line 387, in init_dist_connection torch.distributed.init_process_group( File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\torch\distributed\distributed_c10d.py", line 583, in init_process_group default_pg = _new_process_group_helper( File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\torch\distributed\distributed_c10d.py", line 708, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in

Googling for a solution it seems that Python under Windows does not support NCCL (see e.g. this post). The recomendation is to switch from NCCL to GLOO. However, I can't find the line in the code to do that. Any help appreciated.

Tuxius commented 2 years ago

I found a way to make it work under Windows, by adding two line after line 22 of main.py:

if sys.platform == "win32":
     os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"

I will post a pull request to add these lines.

Tuxius commented 2 years ago

done