NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 831 forks source link

What's the relationship between nccl protcols and inter-node communication? #1296

Open Alex-Wong opened 6 months ago

Alex-Wong commented 6 months ago

Hi, Could you please explain the nccl protocols and inter-node communication? Are nccl protocols(simple/ll/ll128) used between intra-node gpus? It seems that the nccl protocols are used to ensure data consistence between gpus, but when we transfer date between inter-node gpus, the network protocol like RDMA will checksum the datas to guarantee data consistence, so do we still need nccl protocols for the inter-node communication? And how does the nccl protocols work with RDMA?