NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.13k stars 788 forks source link

nccl hangs in clusters with mixed cpu vendors (Intel/AMD), due to inconsistent ALGO/PROTO selection in tuning #1136

Open huzhiwen93 opened 8 months ago

huzhiwen93 commented 8 months ago

We have a training task running in A800 GPU cluster, but with Intel/AMD CPUs on different hosts. Reproducible hangs happened on some specific data sizes with allReduce, allGather, reduceScatter of nccl-tests.

It took some time for us to notice that each rank outputted inconsistent algo/proto combinations (TRACE(NCCL_COLL, "%ld Bytes -> Algo %d proto %d time %f", info->nBytes, info->algorithm, info->protocol, minTime);). Different versions of nccl behaved differently: 2.18.1 and 2.19.3 have such problems while 2.14.3 and 2.17.1 do not.

We did some experiments and finally located the cause of inconsistency. There are two parts of codes considering local CPU vendor in tuning.cc. The first one determines which row of llMaxBws is used, which exists all the way from 2.14.3 / 2.17.1 to 2.18.1 / 2.19.3 (making no difference in our 100G*4 environment becasue busbw/3.8 is smaller than any values of llMaxBws). The second part uses CPU vendor to determine network overhead, added since 2.18.1 (making a significant difference in latency evaluation, causing AMD nodes switching proto/algo at different sizes with Intel nodes).

Currently, our workaround is to set environment variable NCCL_NET_OVERHEAD=1000(Intel value) to escape the second part of codes. And from a longer view, we might have to remove the first part of codes to avoid its potential influences, or to provide fake CPU vendor info with NCCL_TOPO_FILE. Either way is not clean or simple enough.

A comment around allGather3Data in function initTransportRank says: Make sure we align all ranks so that the tuning is consistent across ranks. Therefore, we feel it strange for nccl to consider local CPU type in tuning without synchronization. Is there (/ will there be) a better way to avoid such problems in heterogenous environment?

huzhiwen93 commented 2 weeks ago

@sjeaugey Thanks for your support. I've just reviewed a small part of NCCL v2.22.3, noticing your adjustment made to tuning.cc: getNetOverhead(). The local-cpu-query function ncclTopoCpuType(comm->topo, &cpuArch, &cpuVendor, &cpuModel) is removed. That means we can avoid setting NCCL_NET_OVERHEAD now. But there are still other codes in tuning.cc using ncclTopoCpuType(), i.e., ncclTopoTuneModel(), which uses cpu type to decide latency, influencing the choice of algo/proto. This very function ncclTopoCpuType(), can not benefit from the cpu-info alignment operation in init.cc. So the problem might still exits (actually much tougher, since there is no environment variable to set). In my personal view, we might need to use comm->cpuVendor and comm->cpuArch instead, just like we do in getNetOverhead().

sjeaugey commented 2 weeks ago

Thanks for the feedback. Indeed it looks like an oversight. We'll look into it.

marksantesson commented 1 week ago

Hi @huzhiwen93, here is a patch which should fix the issue and which we'll apply to NCCL 2.24. Feel free to give it a try and let us know whether you see any remaining issue. Thanks!

0001-Architecture-mismatch-further-changes.patch.txt

huzhiwen93 commented 4 days ago

@sjeaugey @marksantesson Hi, I've just encountered another hang issue, which is the cause of non-synchronized graph[a]->pattern among ranks.

In tunning.cc, NCCL treats Balanced-Tree/Split-Tree/Tree differently: if (a == NCCL_ALGO_TREE && graphs[a]->pattern == NCCL_TOPO_PATTERN_TREE) busBw *= .85. But they are not synchronized by some operations like graphs[a]->pattern = std::max(allGather3Data[i].graphInfo[a].pattern, graphs[a]->pattern) in init.cc