Open huzhiwen93 opened 8 months ago
@sjeaugey Thanks for your support. I've just reviewed a small part of NCCL v2.22.3, noticing your adjustment made to tuning.cc: getNetOverhead()
. The local-cpu-query function ncclTopoCpuType(comm->topo, &cpuArch, &cpuVendor, &cpuModel)
is removed. That means we can avoid setting NCCL_NET_OVERHEAD
now.
But there are still other codes in tuning.cc
using ncclTopoCpuType()
, i.e., ncclTopoTuneModel()
, which uses cpu type to decide latency, influencing the choice of algo/proto. This very function ncclTopoCpuType()
, can not benefit from the cpu-info alignment operation in init.cc
. So the problem might still exits (actually much tougher, since there is no environment variable to set). In my personal view, we might need to use comm->cpuVendor
and comm->cpuArch
instead, just like we do in getNetOverhead()
.
Thanks for the feedback. Indeed it looks like an oversight. We'll look into it.
Hi @huzhiwen93, here is a patch which should fix the issue and which we'll apply to NCCL 2.24. Feel free to give it a try and let us know whether you see any remaining issue. Thanks!
@sjeaugey @marksantesson Hi, I've just encountered another hang issue, which is the cause of non-synchronized graph[a]->pattern
among ranks.
In tunning.cc
, NCCL treats Balanced-Tree
/Split-Tree
/Tree
differently: if (a == NCCL_ALGO_TREE && graphs[a]->pattern == NCCL_TOPO_PATTERN_TREE) busBw *= .85
. But they are not synchronized by some operations like graphs[a]->pattern = std::max(allGather3Data[i].graphInfo[a].pattern, graphs[a]->pattern)
in init.cc
We have a training task running in A800 GPU cluster, but with Intel/AMD CPUs on different hosts. Reproducible hangs happened on some specific data sizes with allReduce, allGather, reduceScatter of nccl-tests.
It took some time for us to notice that each rank outputted inconsistent algo/proto combinations (
TRACE(NCCL_COLL, "%ld Bytes -> Algo %d proto %d time %f", info->nBytes, info->algorithm, info->protocol, minTime);
). Different versions of nccl behaved differently: 2.18.1 and 2.19.3 have such problems while 2.14.3 and 2.17.1 do not.We did some experiments and finally located the cause of inconsistency. There are two parts of codes considering local CPU vendor in
tuning.cc
. The first one determines which row ofllMaxBws
is used, which exists all the way from 2.14.3 / 2.17.1 to 2.18.1 / 2.19.3 (making no difference in our 100G*4 environment becasuebusbw/3.8
is smaller than any values ofllMaxBws
). The second part uses CPU vendor to determine network overhead, added since 2.18.1 (making a significant difference in latency evaluation, causing AMD nodes switching proto/algo at different sizes with Intel nodes).Currently, our workaround is to set environment variable
NCCL_NET_OVERHEAD=1000
(Intel value) to escape the second part of codes. And from a longer view, we might have to remove the first part of codes to avoid its potential influences, or to provide fake CPU vendor info withNCCL_TOPO_FILE
. Either way is not clean or simple enough.A comment around
allGather3Data
in functioninitTransportRank
says:Make sure we align all ranks so that the tuning is consistent across ranks
. Therefore, we feel it strange for nccl to consider local CPU type in tuning without synchronization. Is there (/ will there be) a better way to avoid such problems in heterogenous environment?