NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 815 forks source link

NCCL error running distributed tensorflow with slurm + pyxis: NCCL WARN Call to ibv_create_qp failed #562

Open andrew-johnson-melb opened 3 years ago

andrew-johnson-melb commented 3 years ago

System information

The distributed training runs fails when training via slurm (using srun).

The code is run inside an enroot container. Due to slurm this container has a number of slurm specific environment variables set.

So, using MirroredStrategy to distribute training fails due to NCCL errors on a simple example.

Note, a number of other distributed options work as is highlighted in the code.

NOTE, this code works fine outside of the slurm environment (in the exact same container). The slurm environment variables seem to be creating an issue with NCCL.

The srun command looks like

srun \
        --container-image $container_path \
        -N1 \
        --nodelist=$node_name \
        --gpus-per-node=$gpus_per_node \
        --cpus-per-task=$cpus_per_task \
        --mem=$DEFAULT_RAM \
#!/usr/bin/env python
# coding: utf-8

# Can't use tf.distribute.MirroredStrategy in srun (slurm) enviroment

# Tried with tf 2.5 and tf nightly.

import tensorflow as tf

# Force dynamic memory growth
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

tf.__version__

# op 1 . NCCL error in slurmn enviroment. Works fine inside enroot container (not submitted via srun)
strategy = tf.distribute.MirroredStrategy()

# op 2. Not using NCCL. Works.
#strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

# op 2. Works in slurmn enviroment. Needs to be optimized
#slurm_resolver = tf.distribute.cluster_resolver.SlurmClusterResolver()
#strategy = tf.distribute.MultiWorkerMirroredStrategy(cluster_resolver=slurm_resolver)

# op 3 # Works in slurmn enviroment
#strategy = tf.distribute.MultiWorkerMirroredStrategy()

from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

with strategy.scope():

    model = models.Sequential()
    model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(layers.Flatten())
    model.add(layers.Dense(64, activation='relu'))

    model.add(layers.Dense(10))
    # ADD sync bn..
    model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)

Error

. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
2021-08-31 15:13:58.031219: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-31 15:13:58.050668: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2245715000 Hz
Epoch 1/10
2021-08-31 15:14:03.739139: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-31 15:14:04.423163: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:05.254400: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:05.933033: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-31 15:14:06.133786: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:07.302772: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:07.895167: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-08-31 15:14:08.600313: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:09.692392: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:10.554939: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:11.510503: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:12.416170: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
hai-a100-3:3778851:3779455 [6] NCCL INFO Bootstrap : Using enp226s0:10.16.2.21<0>
hai-a100-3:3778851:3779455 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
hai-a100-3:3778851:3779455 [6] NCCL INFO P2P plugin IBext
hai-a100-3:3778851:3779455 [6] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_6:1/IB/SHARP [3]mlx5_8:1/IB/SHARP [4]mlx5_4:1/RoCE [5]mlx5_10:1/RoCE ; OOB enp226s0:10.16.2.21<0>
hai-a100-3:3778851:3779455 [6] NCCL INFO Using network IBext
NCCL version 2.8.3+cudaCUDA_MAJOR.CUDA_MINOR

hai-a100-3:3778851:3779883 [4] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:3778851:3779883 [4] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:3778851:3779884 [5] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed

hai-a100-3:3778851:3779885 [6] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:3778851:3779884 [5] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:3778851:3779884 [5] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:3778851:3779880 [1] ibvwrap.c:106 NCCL WARN Call to ibv_reg_mr failed
hai-a100-3:3778851:3779880 [1] NCCL INFO ib_plugin.c:284 -> 2
hai-a100-3:3778851:3779880 [1] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2

hai-a100-3:3778851:3779886 [7] ibvwrap.c:118 NCCL WARN Call to ibv_create_cq failed
hai-a100-3:3778851:3779886 [7] NCCL INFO ib_plugin.c:174 -> 2
hai-a100-3:3778851:3779880 [1] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779880 [1] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779880 [1] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779880 [1] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2

hai-a100-3:3778851:3779879 [0] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:3778851:3779880 [1] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:3778851:3779881 [2] ibvwrap.c:106 NCCL WARN Call to ibv_reg_mr failed
hai-a100-3:3778851:3779879 [0] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO ib_plugin.c:322 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:22 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:52 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:3778851:3779881 [2] NCCL INFO ib_plugin.c:284 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779881 [2] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779881 [2] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779881 [2] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779881 [2] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779881 [2] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:3778851:3779881 [2] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:3778851:3779882 [3] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:3778851:3779882 [3] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
2021-08-31 15:14:17.616013: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616262: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616307: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616347: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616383: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616413: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616444: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616475: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
Traceback (most recent call last):
  File "basic_distributed_minst_v4.py", line 102, in <module>
    history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1183, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 3023, in __call__
    return graph_function._call_flat(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 591, in call
    outputs = execute.execute(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: 8 root error(s) found.
  (0) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
  (1) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_291]]
  (2) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_295]]
  (3) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_299]]
  (4) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_303]]
  (5) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_307]]
  (6) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_311]]
  (7) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_315]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_9445]

Error when using 21.07 container and tf-nightly

 Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
Epoch 1/10
2021-09-01 09:42:11.001143: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:12.030493: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:13.145586: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:14.594286: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:15.853778: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:17.049535: I tensorflow/stream_executor/cuda/cuda_blas.cc:1774] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2021-09-01 09:42:17.237328: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:18.567093: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:19.610511: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
hai-a100-3:376586:376986 [6] NCCL INFO Bootstrap : Using enp226s0:10.16.2.21<0>
hai-a100-3:376586:376986 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
hai-a100-3:376586:376986 [6] NCCL INFO P2P plugin IBext
hai-a100-3:376586:376986 [6] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_6:1/IB/SHARP [3]mlx5_8:1/IB/SHARP [4]mlx5_4:1/RoCE [5]mlx5_10:1/RoCE ; OOB enp226s0:10.16.2.21<0>
hai-a100-3:376586:376986 [6] NCCL INFO Using network IBext
NCCL version 2.8.3+cudaCUDA_MAJOR.CUDA_MINOR

hai-a100-3:376586:377446 [7] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:376586:377446 [7] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2

hai-a100-3:376586:377440 [1] ibvwrap.c:106 NCCL WARN Call to ibv_reg_mr failed
hai-a100-3:376586:377440 [1] NCCL INFO ib_plugin.c:284 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2

hai-a100-3:376586:377444 [5] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:376586:377444 [5] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2

hai-a100-3:376586:377439 [0] ibvwrap.c:106 NCCL WARN Call to ibv_reg_mr failed
hai-a100-3:376586:377439 [0] NCCL INFO ib_plugin.c:284 -> 2
hai-a100-3:376586:377439 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377439 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377439 [0] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377439 [0] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:376586:377446 [7] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:376586:377444 [5] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:376586:377445 [6] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:376586:377445 [6] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:376586:377439 [0] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:376586:377439 [0] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:376586:377443 [4] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed

hai-a100-3:376586:377442 [3] ibvwrap.c:118 NCCL WARN Call to ibv_create_cq failed
hai-a100-3:376586:377442 [3] NCCL INFO ib_plugin.c:174 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO ib_plugin.c:322 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:22 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:52 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:376586:377443 [4] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:376586:377441 [2] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:376586:377441 [2] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
2021-09-01 09:42:23.237419: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237581: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237606: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237628: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237649: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237670: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237689: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237711: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
Traceback (most recent call last):
  File "basic_distributed_minst_v5.py", line 97, in <module>
    history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: 8 root error(s) found.
  (0) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
  (1) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_307]]
  (2) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_311]]
  (3) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_315]]
  (4) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_291]]
  (5) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_295]]
  (6) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_299]]
  (7) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_303]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_9449]

Errors may have originated from an input operation.
Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Function call stack:
train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function

Function call stack: train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function

sjeaugey commented 3 years ago

This is probably because the slurm daemon was launched with a limited ability to lock memory. Infiniband needs to lock a certain amount of memory for the NIC communication. srun bash -c "ulimit -l" would confirm that. To fix the issue, you can set a higher limit in /etc/security/limits.conf and restart the SLURM daemon on compute nodes.

andrew-johnson-melb commented 3 years ago

Great, thanks @sjeaugey. That did fix the error, much appreciated!

However, now it seems to be training about 40% slower than the local version. Any suggestions?

sjeaugey commented 3 years ago

The best would probably be to run the NCCL perf tests to see whether the performance difference comes from NCCL or something else (e.g. CPU affinity).

andrew-johnson-melb commented 3 years ago

Solved. Thanks!

pGit1 commented 3 years ago

Solved. Thanks!

@andrew-johnson-melb How did you solve this? It would be helpful to post your solution. Im getting this error EVERYWHERE in my code when train model with mirror strategy on A100 GPU when my code works just fine on P100 cards.