OpenMOSS / CoLLiE

Collaborative Training of Large Language Models in an Efficient Way
https://openlmlab-collie.readthedocs.io
Apache License 2.0
405 stars 58 forks source link

fix(dist_utils): fix port conflict in setup_distribution #178

Closed gyt1145028706 closed 4 months ago

gyt1145028706 commented 4 months ago

如果端口冲突,则寻找一个未被占用的端口并修改 os.environ["MASTER_PORT"]

KaiLv69 commented 4 months ago

这样每个rank可能因为先后顺序,导致获得的master_port不一样。可以像torchrun一样直接报错终止程序,并提示用户修改环境变量。

gyt1145028706 commented 4 months ago

用bind会出现 将可用的端口判为不可用的情况

比如按照提示export了新的端口 但是下一次用的时候还会检测到port used 为False 改为connect就没这个问题

---Original--- From: "Kai @.> Date: Thu, May 9, 2024 11:26 AM To: @.>; Cc: "Yitian @.**@.>; Subject: Re: [OpenMOSS/CoLLiE] fix(dist_utils): fix port conflict insetup_distribution (PR #178)

@KaiLv69 commented on this pull request.

In collie/utils/dist_utils.py: > @@ -167,6 +168,24 @@ def _decompose_slurm_nodes(s): return results +def port_used(host: str, port: int) -> bool: + "检查端口是否被占用" + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + try: + s.connect((host, port)) # 尝试绑定到本地地址和指定端口
为什么这里从bind改成了connect?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

KaiLv69 commented 4 months ago

用bind会出现 将可用的端口判为不可用的情况 比如按照提示export了新的端口 但是下一次用的时候还会检测到port used 为False 改为connect就没这个问题 ---Original--- From: "Kai @.> Date: Thu, May 9, 2024 11:26 AM To: @.>; Cc: "Yitian @.**@.>; Subject: Re: [OpenMOSS/CoLLiE] fix(dist_utils): fix port conflict insetup_distribution (PR #178) @KaiLv69 commented on this pull request. In collie/utils/dist_utils.py: > @@ -167,6 +168,24 @@ def _decompose_slurm_nodes(s): return results +def port_used(host: str, port: int) -> bool: + "检查端口是否被占用" + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + try: + s.connect((host, port)) # 尝试绑定到本地地址和指定端口 为什么这里从bind改成了connect? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

export环境变量本身并不会占用端口,s.connect在这里用不合适吧,你可以看看connect和bind的区别