JuliaWolleb / Diffusion-based-Segmentation

This is the official Pytorch implementation of the paper "Diffusion Models for Implicit Image Segmentation Ensembles".
MIT License
271 stars 35 forks source link

error when training on multiple gpus #63

Open CNGaoWenbo opened 3 months ago

CNGaoWenbo commented 3 months ago

I initialized the multiple training using torchrun, but it stuck here.

Setting up a new session... Setting up a new session... Setting up a new session...

Does anyone have an idea? thanks

CNGaoWenbo commented 3 months ago

I only changed dist_util `GPUS_PER_NODE = 4 #change to 4

SETUP_RETRY_COUNT = 3

def setup_dist():

if dist.is_initialized():
    return
os.environ["CUDA_VISIBLE_DEVICES"] = '6,7,8,9' #change to '6,7,8,9'

backend = "gloo" if not th.cuda.is_available() else "nccl"

if backend == "gloo":
    hostname = "localhost"
else:
    hostname = socket.gethostbyname(socket.getfqdn())
os.environ["MASTER_ADDR"] = '127.0.1.1'#comm.bcast(hostname, root=0)
os.environ["RANK"] = '0'#str(comm.rank)
os.environ["WORLD_SIZE"] = '4'# change to 4

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(("", 0))
s.listen(1)
port = s.getsockname()[1]
s.close()
os.environ["MASTER_PORT"] = str(port)
dist.init_process_group(backend=backend, init_method="env://")`