Open leye7755 opened 7 years ago
Hey, thanks for opening the issue.
I mentioned in our conversation earlier that we would have to make Gloo pick use predefined ports such that you can whitelist them in your environment. This is not a current feature. As I mentioned, the only way to make this work today is to remove any firewalling between the machines you intend to use for distributed training. A possible solution would use consecutive ports for all of its peers (e.g. 5000, 5001, etc, one for every peer). Then you would only have to whitelist a number of ports equal to the number of machines you intend to use. But this is not available today.
Thank you for you help. And Did you mean that I should modify the code of gloo. Can you tell me where I should modify the code . It would very useful if you can provide process or architecture of gloo . Thank you @pietern
You could start in Pair::listen
, that's where the bind(2)
function is called (link). You could choose to use for example 8 ports round robin if your context is not larger than 8 machines.
I mentioned in our conversation earlier that we would have to make Gloo pick use predefined ports such that you can whitelist them in your environment.
Hi @pietern, what is the range of ports Gloo uses for the tcp transport layer?
@erikwijmans Currently it still lets the OS pick a port to bind to. It is technically possible to force a range of ports on the listening side, as long as the number of ports in the range is equal to the number of participants in the context. This is not implemented today though.
For my curiosity: are you trying to use Gloo in a firewalled environment?
Yes, the firewalls on our cluster is fairly restrictive, but we’d like to be able to open up enough ports to use gloo. Any suggestions? Or is there a range that our OS (Ubuntu 16.04) will tend to use and we can open that range?
It lets the operating system decide which port to use. AFAIK this means it picks an unused port from the ephemeral port range. You can get/set this range with sysctl
or by editing procfs
values directly. By default the range is rather large:
$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 65534
Also stumbled on this: https://github.com/pytorch/pytorch/issues/44544
The comment on this made me realized I never followed up! Using systctl
to set net.ipv4.ip_local_port_range
to some small(-ish) port range (we did a range of 3,000 and that seems to be more than adequate for our cluster size) and then opening that range in the firewall worked perfectly.
is this feature implemented in gloo? can we specify the tcp ports now?
I found that every time TCP connection with random port in Gloo ,But this condition is not suitable for mine . In my work, I need to open specific port for them to work. And I get the help from @pietern that tell me having Gloo pick from a predefined set of ports. But this is still some confusion. Is there any way help me to solve it or some detail about it . by the way , I use it to distribute train for caffe2. Many thanks !
@pietern @zpao @yfeldblum @achao @gfosco