NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 821 forks source link

NCCL NAT #680

Open spyroot opened 2 years ago

spyroot commented 2 years ago

Hi Folks,

I was trying to understand why NCCL doesn't negotiate port, i.e., nat transparency. If one instance inside a docker, let's say port translated 54321. i.e -p 54321:54321 On another host, it has a symmetrical operation, 54321:54321.

If I understood, the main problem was not the initial handshake, but down the line master doesn't see what port the remote host uses. So logically, why not scope a range, 5400-5510, and do translation on each host, i.e. -p

I'll try to patch myself and allocate the src port deterministically. But can you please explain the logic behind the current decision?

Note that docker does 1:1 mapping, hence if you have 3 worker node you should be able to create 3 pair mapping.

ty

sjeaugey commented 2 years ago

The current design assumes that different NCCL ranks can access each other directly through at least one IP interface. So when one rank opens a port, it can pass its IP/port information to another rank and the rank will be able to connect and communicate through sockets.

There isn't much more than that: we just don't support other cases currently. Users should setup a private (secure) network between their nodes and tell NCCL to use that network through the NCCL_SOCKET_IFNAME environment variable.

We don't have a way to control which port to use. I know there is a way to configure linux to restrict user ports to a given range and it seemed okay for some users. I don't think it would be hard to add a feature to specify a range, but getting all the ports properly routed and permit rank-to-rank communication through NAT is a substantial effort and probably more than setting up a private network.

spyroot commented 2 years ago

The main issue, I think, is it doesn't do nat traversal. In essence, if docker is set up with NAT static 1:1 translation, it doesn't work. This is because it has a half-open connection. if you can indicate master:port_x, worker:port_y, worker:port_z. Then you can create a nat rule that allows inbound connection from a different port. i.e, NAT translation.