We tried to train with multi-card in a single machine, and we got an error, here is the core stack:
#0 __memset_avx2_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:151
#1 0x00007f4b56654d7c in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#2 0x00007f4b566573ae in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#3 0x00007f4b5663c683 in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#4 0x00007f4b5662d7b4 in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#5 0x00007f4b5662eb3d in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#6 0x00007f4b5662f221 in ?? () from /lib/x86_64-linux-gnu/libnccl.so.2
#7 0x00007f4b5662f34b in ncclCommInitRank () from /lib/x86_64-linux-gnu/libnccl.so.2
#8 0x00007f4b62ffa982 in tensorflow::hybridbackend::NcclCollective::Create (this=0x7f497c02d1b0, id=...) at ./hybridbackend/tensorflow/distribute/collective.h:78
#9 0x00007f4b630b358f in tensorflow::hybridbackend::CreateNcclCollectiveOp::ComputeAsync(tensorflow::OpKernelContext*, std::function<void ()>)::{lambda()#1}::operator()() const (
__closure=0x7f461400d3c0) at hybridbackend/tensorflow/distribute/nccl/nccl_create.cc:98
#10 0x00007f4b630dcc6c in std::function<void ()>::operator()() const (this=0x7f498ebe9938) at /usr/include/c++/9/bits/std_function.h:683
#11 tensorflow::hybridbackend::Stream::<lambda()>::operator() (__closure=0x7f498ebe9920) at hybridbackend/tensorflow/common/stream.cc:106
#12 std::_Function_handler<void(), tensorflow::hybridbackend::Stream::LaunchUntilComputeDone(tensorflow::OpKernelContext*, std::function<void()>)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#13 0x00007f4be7c6978d in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
from /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#14 0x00007f4be7c6474c in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#15 0x00007f4be66a9de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#16 0x00007f4ca0272609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#17 0x00007f4ca03ac133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Expected behavior
Train without error
System information
GPU model and memory:
OS Platform:
Docker version:
GCC/CUDA/cuDNN version: CUDA 11.6
Python/conda version:
TensorFlow + DeepRec version: 1.15.5+deeprec2306
HybridBackend hybridbackend-deeprec-cu116 1.0.0
Code to reproduce
from tensorflow.python.distribute.group_embedding_collective_strategy import CollectiveStrategy
strategy = CollectiveStrategy()
with strategy.scope(), tf.Graph().as_default():
.....
// Run under: COLLECTIVE_STRATEGY=hb
Current behavior
We tried to train with multi-card in a single machine, and we got an error, here is the core stack:
Expected behavior
Train without error
System information
Code to reproduce
GPU memory : 16GB Container memory : 100GB
Launch command:
Willing to contribute
Yes