This is the official Pytorch implementation of the paper "Diffusion Models for Implicit Image Segmentation Ensembles".
error when training on multiple gpus #63

Open CNGaoWenbo opened 3 months ago

CNGaoWenbo commented 3 months ago

I initialized the multiple training using torchrun, but it stuck here.

Setting up a new session... Setting up a new session... Setting up a new session...

Does anyone have an idea? thanks

CNGaoWenbo commented 3 months ago

I only changed dist_util `GPUS_PER_NODE = 4 #change to 4


def setup_dist():

if dist.is_initialized():
os.environ["CUDA_VISIBLE_DEVICES"] = '6,7,8,9' #change to '6,7,8,9'

backend = "gloo" if not th.cuda.is_available() else "nccl"

if backend == "gloo":
    hostname = "localhost"
    hostname = socket.gethostbyname(socket.getfqdn())
os.environ["MASTER_ADDR"] = ''#comm.bcast(hostname, root=0)
os.environ["RANK"] = '0'#str(comm.rank)
os.environ["WORLD_SIZE"] = '4'# change to 4

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(("", 0))
port = s.getsockname()[1]
os.environ["MASTER_PORT"] = str(port)
dist.init_process_group(backend=backend, init_method="env://")`