Open Konerusudhir opened 1 year ago
I tried to run this on M2 Mac. I get below errors when distributed.all_gather is called.
(dire) skoneru@macbook-pro guided-diffusion % PYTORCH_ENABLE_MPS_FALLBACK=1 ./compute_dire.sh Logging to /Users/skoneru/workspace/DIRE/recons_images/val/imagenet/real Namespace(images_dir='/Users/skoneru/workspace/DIRE/images/val/imagenet/real', recons_dir='/Users/skoneru/workspace/DIRE/recons_images/val/imagenet/real', dire_dir='/Users/skoneru/workspace/DIRE/dire_images/val/imagenet/real', clip_denoised=True, num_samples=16, batch_size=4, use_ddim=True, model_path='models/256x256_diffusion_uncond.pt', real_step=0, continue_reverse=False, has_subfolder=True, image_size=256, num_channels=256, num_res_blocks=2, num_heads=4, num_heads_upsample=-1, num_head_channels=64, attention_resolutions='32,16,8', channel_mult='', dropout=0.1, class_cond=False, use_checkpoint=False, use_scale_shift_norm=True, resblock_updown=True, use_fp16=False, use_new_attention_order=False, learn_sigma=True, diffusion_steps=1000, noise_schedule='linear', timestep_respacing='ddim20', use_kl=False, predict_xstart=False, rescale_timesteps=False, rescale_learned_sigmas=False) have created model and diffusion have created data loader computing recons & DIRE ... dataset length: 5000 Traceback (most recent call last): File "/Users/skoneru/workspace/DIRE/guided-diffusion/compute_dire.py", line 172, in <module> main() File "/Users/skoneru/workspace/DIRE/guided-diffusion/compute_dire.py", line 121, in main dist.all_gather(gathered_samples, recons) # gather not supported with NCCL File "/Users/skoneru/miniconda/envs/dire/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper return func(*args, **kwargs) File "/Users/skoneru/miniconda/envs/dire/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2448, in all_gather work = default_pg.allgather([tensor_list], [tensor]) RuntimeError: ProcessGroupGloo::allgather: invalid tensor type at index 0 (expected TensorOptions(dtype=unsigned char, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)), got TensorOptions(dtype=unsigned char, device=mps:0, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
Command used is mpiexec -n 1 python compute_dire.py --model_path $MODEL_PATH $MODEL_FLAGS $SAVE_FLAGS $SAMPLE_FLAGS --has_subfolder True
mpiexec -n 1 python compute_dire.py --model_path $MODEL_PATH $MODEL_FLAGS $SAVE_FLAGS $SAMPLE_FLAGS --has_subfolder True
Changes: Set device to "mps"
I tried to run this on M2 Mac. I get below errors when distributed.all_gather is called.
(dire) skoneru@macbook-pro guided-diffusion % PYTORCH_ENABLE_MPS_FALLBACK=1 ./compute_dire.sh Logging to /Users/skoneru/workspace/DIRE/recons_images/val/imagenet/real Namespace(images_dir='/Users/skoneru/workspace/DIRE/images/val/imagenet/real', recons_dir='/Users/skoneru/workspace/DIRE/recons_images/val/imagenet/real', dire_dir='/Users/skoneru/workspace/DIRE/dire_images/val/imagenet/real', clip_denoised=True, num_samples=16, batch_size=4, use_ddim=True, model_path='models/256x256_diffusion_uncond.pt', real_step=0, continue_reverse=False, has_subfolder=True, image_size=256, num_channels=256, num_res_blocks=2, num_heads=4, num_heads_upsample=-1, num_head_channels=64, attention_resolutions='32,16,8', channel_mult='', dropout=0.1, class_cond=False, use_checkpoint=False, use_scale_shift_norm=True, resblock_updown=True, use_fp16=False, use_new_attention_order=False, learn_sigma=True, diffusion_steps=1000, noise_schedule='linear', timestep_respacing='ddim20', use_kl=False, predict_xstart=False, rescale_timesteps=False, rescale_learned_sigmas=False) have created model and diffusion have created data loader computing recons & DIRE ... dataset length: 5000 Traceback (most recent call last): File "/Users/skoneru/workspace/DIRE/guided-diffusion/compute_dire.py", line 172, in <module> main() File "/Users/skoneru/workspace/DIRE/guided-diffusion/compute_dire.py", line 121, in main dist.all_gather(gathered_samples, recons) # gather not supported with NCCL File "/Users/skoneru/miniconda/envs/dire/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper return func(*args, **kwargs) File "/Users/skoneru/miniconda/envs/dire/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2448, in all_gather work = default_pg.allgather([tensor_list], [tensor]) RuntimeError: ProcessGroupGloo::allgather: invalid tensor type at index 0 (expected TensorOptions(dtype=unsigned char, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)), got TensorOptions(dtype=unsigned char, device=mps:0, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
Command used is
mpiexec -n 1 python compute_dire.py --model_path $MODEL_PATH $MODEL_FLAGS $SAVE_FLAGS $SAMPLE_FLAGS --has_subfolder True
Changes: Set device to "mps"