DeepRec-AI / HybridBackend

A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster
Apache License 2.0
156 stars 30 forks source link

Error in multi-card in a single machine mode #154

Closed dixingxing0 closed 1 year ago

dixingxing0 commented 1 year ago

Current behavior

We tried to train with multi-card in a single machine, and we got an error, here is the core stack:

#0  __memset_avx2_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:151
#1  0x00007f4b56654d7c in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#2  0x00007f4b566573ae in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#3  0x00007f4b5663c683 in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#4  0x00007f4b5662d7b4 in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#5  0x00007f4b5662eb3d in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#6  0x00007f4b5662f221 in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#7  0x00007f4b5662f34b in ncclCommInitRank () from /lib/x86_64-linux-gnu/libnccl.so.2
#8  0x00007f4b62ffa982 in tensorflow::hybridbackend::NcclCollective::Create (this=0x7f497c02d1b0, id=...) at ./hybridbackend/tensorflow/distribute/collective.h:78
#9  0x00007f4b630b358f in tensorflow::hybridbackend::CreateNcclCollectiveOp::ComputeAsync(tensorflow::OpKernelContext*, std::function<void ()>)::{lambda()#1}::operator()() const (
    __closure=0x7f461400d3c0) at hybridbackend/tensorflow/distribute/nccl/nccl_create.cc:98
#10 0x00007f4b630dcc6c in std::function<void ()>::operator()() const (this=0x7f498ebe9938) at /usr/include/c++/9/bits/std_function.h:683
#11 tensorflow::hybridbackend::Stream::<lambda()>::operator() (__closure=0x7f498ebe9920) at hybridbackend/tensorflow/common/stream.cc:106
#12 std::_Function_handler<void(), tensorflow::hybridbackend::Stream::LaunchUntilComputeDone(tensorflow::OpKernelContext*, std::function<void()>)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#13 0x00007f4be7c6978d in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#14 0x00007f4be7c6474c in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#15 0x00007f4be66a9de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#16 0x00007f4ca0272609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#17 0x00007f4ca03ac133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Expected behavior

Train without error

System information

Code to reproduce


from tensorflow.python.distribute.group_embedding_collective_strategy import CollectiveStrategy
strategy = CollectiveStrategy()
with strategy.scope(), tf.Graph().as_default():
    .....

// Run under: COLLECTIVE_STRATEGY=hb 

GPU memory : 16GB Container memory : 100GB

Launch command:

CUDA_VISIBLE_DEVICES=0,1,2,3 COLLECTIVE_STRATEGY=hb python3 -m tensorflow.python.distribute.launch python3 run.py 

Willing to contribute

Yes