Open CNGaoWenbo opened 8 months ago
I only changed dist_util `GPUS_PER_NODE = 4 #change to 4
SETUP_RETRY_COUNT = 3
def setup_dist():
if dist.is_initialized():
return
os.environ["CUDA_VISIBLE_DEVICES"] = '6,7,8,9' #change to '6,7,8,9'
backend = "gloo" if not th.cuda.is_available() else "nccl"
if backend == "gloo":
hostname = "localhost"
else:
hostname = socket.gethostbyname(socket.getfqdn())
os.environ["MASTER_ADDR"] = '127.0.1.1'#comm.bcast(hostname, root=0)
os.environ["RANK"] = '0'#str(comm.rank)
os.environ["WORLD_SIZE"] = '4'# change to 4
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(("", 0))
s.listen(1)
port = s.getsockname()[1]
s.close()
os.environ["MASTER_PORT"] = str(port)
dist.init_process_group(backend=backend, init_method="env://")`
I initialized the multiple training using torchrun, but it stuck here.
Setting up a new session... Setting up a new session... Setting up a new session...
Does anyone have an idea? thanks