Hi,
Could you please explain the nccl protocols and inter-node communication?
Are nccl protocols(simple/ll/ll128) used between intra-node gpus? It seems that the nccl protocols are used to ensure data consistence between gpus, but when we transfer date between inter-node gpus, the network protocol like RDMA will checksum the datas to guarantee data consistence, so do we still need nccl protocols for the inter-node communication? And how does the nccl protocols work with RDMA?
Hi, Could you please explain the nccl protocols and inter-node communication? Are nccl protocols(simple/ll/ll128) used between intra-node gpus? It seems that the nccl protocols are used to ensure data consistence between gpus, but when we transfer date between inter-node gpus, the network protocol like RDMA will checksum the datas to guarantee data consistence, so do we still need nccl protocols for the inter-node communication? And how does the nccl protocols work with RDMA?