NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
266 stars 30 forks source link

tf.distribute.MirroredStrategy fails when run with pyxis + slurm: causes NCCL error #51760 #60

Closed andrew-johnson-melb closed 2 years ago

andrew-johnson-melb commented 2 years ago

This is really a combined problem with slurm + tensorflow + pyxis. I'm yet to hear anything from the TF team, so I was hoping you might have an idea @flx42 (any suggestions would be very much appreciated).

Something weird happens with NCCL when inside an enroot container submitted via slurm with pyxis. Essentially, the tf.distribute.MirroredStrategy strategy fails, due to NCCL errors. I've provided all the information below.

Again, I understand this is not entirely appropriate for this repo. But any help would be great. This issue is at the boundary between a few things.

System information

The distributed training runs fails when training via slurm (using srun).

The code is run inside an enroot container. Due to slurm this container has a number of slurm specific environment variables set.

So, using MirroredStrategy to distribute training fails due to NCCL errors on a simple example.

Note, a number of other distributed options work as is highlighted in the code.

NOTE, this code works fine outside of the slurm environment (in the exact same container). The slurm environment variables seem to be creating an issue with NCCL.

The srun command looks like

srun \
        --container-image $container_path \
        -N1 \
        --nodelist=$node_name \
        --gpus-per-node=$gpus_per_node \
        --cpus-per-task=$cpus_per_task \
        --mem=$DEFAULT_RAM \
#!/usr/bin/env python
# coding: utf-8

# Can't use tf.distribute.MirroredStrategy in srun (slurm) enviroment

# Tried with tf 2.5 and tf nightly.

import tensorflow as tf

# Force dynamic memory growth
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

tf.__version__

# op 1 . NCCL error in slurmn enviroment. Works fine inside enroot container (not submitted via srun)
strategy = tf.distribute.MirroredStrategy()

# op 2. Not using NCCL. Works.
#strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

# op 2. Works in slurmn enviroment. Needs to be optimized
#slurm_resolver = tf.distribute.cluster_resolver.SlurmClusterResolver()
#strategy = tf.distribute.MultiWorkerMirroredStrategy(cluster_resolver=slurm_resolver)

# op 3 # Works in slurmn enviroment
#strategy = tf.distribute.MultiWorkerMirroredStrategy()

from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images / 255.0, test_images / 255.0

with strategy.scope():

    model = models.Sequential()
    model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(layers.Flatten())
    model.add(layers.Dense(64, activation='relu'))

    model.add(layers.Dense(10))
    # ADD sync bn..
    model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)

Error

. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
2021-08-31 15:13:58.031219: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-31 15:13:58.050668: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2245715000 Hz
Epoch 1/10
2021-08-31 15:14:03.739139: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-31 15:14:04.423163: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:05.254400: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:05.933033: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-31 15:14:06.133786: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:07.302772: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:07.895167: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-08-31 15:14:08.600313: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:09.692392: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:10.554939: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:11.510503: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-08-31 15:14:12.416170: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
hai-a100-3:3778851:3779455 [6] NCCL INFO Bootstrap : Using enp226s0:10.16.2.21<0>
hai-a100-3:3778851:3779455 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
hai-a100-3:3778851:3779455 [6] NCCL INFO P2P plugin IBext
hai-a100-3:3778851:3779455 [6] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_6:1/IB/SHARP [3]mlx5_8:1/IB/SHARP [4]mlx5_4:1/RoCE [5]mlx5_10:1/RoCE ; OOB enp226s0:10.16.2.21<0>
hai-a100-3:3778851:3779455 [6] NCCL INFO Using network IBext
NCCL version 2.8.3+cudaCUDA_MAJOR.CUDA_MINOR

hai-a100-3:3778851:3779883 [4] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:3778851:3779883 [4] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779883 [4] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:3778851:3779884 [5] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed

hai-a100-3:3778851:3779885 [6] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:3778851:3779884 [5] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779884 [5] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779885 [6] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:3778851:3779884 [5] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:3778851:3779880 [1] ibvwrap.c:106 NCCL WARN Call to ibv_reg_mr failed
hai-a100-3:3778851:3779880 [1] NCCL INFO ib_plugin.c:284 -> 2
hai-a100-3:3778851:3779880 [1] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2

hai-a100-3:3778851:3779886 [7] ibvwrap.c:118 NCCL WARN Call to ibv_create_cq failed
hai-a100-3:3778851:3779886 [7] NCCL INFO ib_plugin.c:174 -> 2
hai-a100-3:3778851:3779880 [1] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779880 [1] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779880 [1] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779880 [1] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2

hai-a100-3:3778851:3779879 [0] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:3778851:3779880 [1] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:3778851:3779881 [2] ibvwrap.c:106 NCCL WARN Call to ibv_reg_mr failed
hai-a100-3:3778851:3779879 [0] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO ib_plugin.c:322 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:22 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:52 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779886 [7] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:3778851:3779881 [2] NCCL INFO ib_plugin.c:284 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779881 [2] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779881 [2] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779881 [2] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779881 [2] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779881 [2] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779879 [0] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:3778851:3779881 [2] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:3778851:3779882 [3] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:3778851:3779882 [3] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:3778851:3779882 [3] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
2021-08-31 15:14:17.616013: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616262: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616307: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616347: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616383: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616413: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616444: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-08-31 15:14:17.616475: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at nccl_ops.cc:104 : Internal: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
Traceback (most recent call last):
  File "basic_distributed_minst_v4.py", line 102, in <module>
    history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1183, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 3023, in __call__
    return graph_function._call_flat(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 591, in call
    outputs = execute.execute(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: 8 root error(s) found.
  (0) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
  (1) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_291]]
  (2) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_295]]
  (3) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_299]]
  (4) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_303]]
  (5) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_307]]
  (6) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_311]]
  (7) Internal:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce (defined at basic_distributed_minst_v4.py:102) ]]
         [[Adam/Adam/group_deps/_315]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_9445]

Error when using 21.07 container and tf-nightly

 Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
Epoch 1/10
2021-09-01 09:42:11.001143: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:12.030493: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:13.145586: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:14.594286: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:15.853778: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:17.049535: I tensorflow/stream_executor/cuda/cuda_blas.cc:1774] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2021-09-01 09:42:17.237328: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:18.567093: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
2021-09-01 09:42:19.610511: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8202
hai-a100-3:376586:376986 [6] NCCL INFO Bootstrap : Using enp226s0:10.16.2.21<0>
hai-a100-3:376586:376986 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
hai-a100-3:376586:376986 [6] NCCL INFO P2P plugin IBext
hai-a100-3:376586:376986 [6] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_6:1/IB/SHARP [3]mlx5_8:1/IB/SHARP [4]mlx5_4:1/RoCE [5]mlx5_10:1/RoCE ; OOB enp226s0:10.16.2.21<0>
hai-a100-3:376586:376986 [6] NCCL INFO Using network IBext
NCCL version 2.8.3+cudaCUDA_MAJOR.CUDA_MINOR

hai-a100-3:376586:377446 [7] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:376586:377446 [7] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377446 [7] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2

hai-a100-3:376586:377440 [1] ibvwrap.c:106 NCCL WARN Call to ibv_reg_mr failed
hai-a100-3:376586:377440 [1] NCCL INFO ib_plugin.c:284 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2

hai-a100-3:376586:377444 [5] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:376586:377444 [5] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377444 [5] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2

hai-a100-3:376586:377439 [0] ibvwrap.c:106 NCCL WARN Call to ibv_reg_mr failed
hai-a100-3:376586:377439 [0] NCCL INFO ib_plugin.c:284 -> 2
hai-a100-3:376586:377439 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377439 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377439 [0] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377439 [0] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377440 [1] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:376586:377446 [7] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:376586:377444 [5] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:376586:377445 [6] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:376586:377445 [6] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:376586:377439 [0] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:376586:377445 [6] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:376586:377439 [0] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:376586:377443 [4] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed

hai-a100-3:376586:377442 [3] ibvwrap.c:118 NCCL WARN Call to ibv_create_cq failed
hai-a100-3:376586:377442 [3] NCCL INFO ib_plugin.c:174 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO ib_plugin.c:322 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:22 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:52 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377443 [4] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:376586:377442 [3] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
hai-a100-3:376586:377443 [4] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]

hai-a100-3:376586:377441 [2] ibvwrap.c:130 NCCL WARN Call to ibv_create_qp failed
hai-a100-3:376586:377441 [2] NCCL INFO ib_plugin.c:196 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO ib_plugin.c:273 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:21 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:51 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO external/nccl_archive/src/init.cc:310 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO external/nccl_archive/src/init.cc:577 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO external/nccl_archive/src/init.cc:878 -> 2
hai-a100-3:376586:377441 [2] NCCL INFO external/nccl_archive/src/group.cc:72 -> 2 [Async thread]
2021-09-01 09:42:23.237419: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237581: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237606: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237628: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237649: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237670: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237689: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
2021-09-01 09:42:23.237711: W tensorflow/core/framework/op_kernel.cc:1694] OP_REQUIRES failed at nccl_ops.cc:104 : INTERNAL: NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
Traceback (most recent call last):
  File "basic_distributed_minst_v5.py", line 97, in <module>
    history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: 8 root error(s) found.
  (0) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
  (1) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_307]]
  (2) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_311]]
  (3) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_315]]
  (4) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_291]]
  (5) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_295]]
  (6) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_299]]
  (7) INTERNAL:  NCCL: unhandled system error. Set NCCL_DEBUG=WARN for detail.
         [[node Adam/NcclAllReduce
 (defined at /usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py:151)
]]
         [[Adam/Adam/group_deps/_303]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_9449]

Errors may have originated from an input operation.
Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Input Source operations connected to node Adam/NcclAllReduce:
In[0] Adam/split:

Operation defined at: (most recent call last)
>>>   File "basic_distributed_minst_v5.py", line 97, in <module>
>>>     history = model.fit(train_images, train_labels, epochs=10, steps_per_epoch=100)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1185, in fit
>>>     tmp_logs = self.train_function(iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 851, in train_function
>>>     return step_function(self, iterator)
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 840, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>>   File "/usr/local/lib/python3.8/dist-packages/keras/optimizer_v2/utils.py", line 151, in _all_reduce_sum_fn
>>>     return distribution.extended.batch_reduce_to(tf.distribute.ReduceOp.SUM,
>>>

Function call stack:
train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function

Function call stack: train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function

flx42 commented 2 years ago

This might be a bit tricky to debug, but we can try :)

MOFED kernel version

What's the version of MLNX OFED on the system? (kernel side). A command like this one might help:

$ dpkg -l | grep mlnx-ofed-kernel
ii  mlnx-ofed-kernel-dkms                  5.3-OFED.5.3.1.0.0.1                    all          DKMS support for mlnx-ofed kernel modules
ii  mlnx-ofed-kernel-utils                 5.3-OFED.5.3.1.0.0.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules

InfiniBand devices

 NCCL INFO NET/IB : Using [0]mlx5_1:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_6:1/IB/SHARP [3]mlx5_8:1/IB/SHARP [4]mlx5_4:1/RoCE [5]mlx5_10:1/RoCE ; OOB enp226s0:10.16.2.21<0>

I'm not sure if that's the root cause, but it might be a problem to have the mlx5_4 and mlx5_10 devices exposed here, particularly in RoCE mode.

If you have ENROOT_RESTRICT_DEV y in enroot.conf, you can exclude those devices by running your container with MELLANOX_VISIBLE_DEVICES=1,3,6,8. Verify that this line only shows IB devices after this change.

nccl-tests

Can you install https://github.com/NVIDIA/nccl-tests.git inside the Dockerfile and try with the all_reduce_perf binary? Here is my Dockerfile recipe:

RUN cd /usr/local/src && \                                                                                                                             
    NCCL_TESTS_VERSION="1f8f5416863a3082975b10eaa05fecee6fe870c8" && \                                                                                 
    curl --proto '=https' -fSsL https://github.com/NVIDIA/nccl-tests/archive/${NCCL_TESTS_VERSION}.tar.gz | tar xz && \                                
    cd nccl-tests-${NCCL_TESTS_VERSION} && \                                                                                                           
    make MPI=1 && \                                                                                                                                    
    install -m 755 build/all_* build/broadcast_* build/reduce_* /usr/local/bin 

Then to launch it:

$ srun --container-image=<image> --mpi=pmix --ntasks-per-node=8 all_reduce_perf -b 4 -e 4G -f 2 -c 1 -n 100

I suspect this will also fail, and that would be a good step since TF would be out of the picture.

andrew-johnson-melb commented 2 years ago

Hey Felix, thanks for the fast response!

  1. Version of MLXN
ii  mlnx-ofed-kernel-dkms                                       5.1-OFED.5.1.2.5.8.1                    all          DKMS support for mlnx-ofed kernel modules
ii  mlnx-ofed-kernel-utils                                      5.1-OFED.5.1.2.5.8.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules

Ah, this may be relevant, we currently don't have the pmix plugin

srun: MPI types are...
srun: none
srun: pmi2
srun: cray_shasta

But, using pmi2 the all reduce test seems to work

Sorry, where/when would I set MELLANOX_VISIBLE_DEVICE=1,3,6,8? Thanks.

andrew_johnson@mgmt01:~/git/ct_brain$ srun --nodelist hai-a100-1 --container-image=/mnt/shared/sqsh_files/ctb_nv_test.sqsh --mpi=pmi2 --ntasks-per-node=8 all_reduce_perf -b 4 -e 4G -f 2 -c 1 -n 10
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1 
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1 
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1 
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1 
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1 
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1 
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1 
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1 
#
# Using devices
#   Rank  0 Pid 3022974 on hai-a100-1 device  0 [0x07] A100-SXM-80GB
#   Rank  0 Pid 3022976 on hai-a100-1 device  0 [0x07] A100-SXM-80GB
#   Rank  0 Pid 3022975 on hai-a100-1 device  0 [0x07] A100-SXM-80GB
#   Rank  0 Pid 3022969 on hai-a100-1 device  0 [0x07] A100-SXM-80GB
#   Rank  0 Pid 3022970 on hai-a100-1 device  0 [0x07] A100-SXM-80GB
#   Rank  0 Pid 3022973 on hai-a100-1 device  0 [0x07] A100-SXM-80GB
#   Rank  0 Pid 3022971 on hai-a100-1 device  0 [0x07] A100-SXM-80GB
#   Rank  0 Pid 3022972 on hai-a100-1 device  0 [0x07] A100-SXM-80GB
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           4             1     float     sum     5.08    0.00    0.00  0e+00     0.56    0.01    0.00  0e+00
hai-a100-1: Test CUDA failure common.cu:763 'out of memory'
 .. hai-a100-1 pid 3022973: Test failure common.cu:1007
 .. hai-a100-1 pid 3022973: Test failure common.cu:925
           8             2     float     sum     5.06    0.00    0.00  0e+00     0.68    0.01    0.00  0e+00
          16             4     float     sum     5.12    0.00    0.00  0e+00     0.69    0.02    0.00  0e+00
          32             8     float     sum     4.85    0.01    0.00  0e+00     0.62    0.05    0.00  0e+00
          64            16     float     sum     4.91    0.01    0.00  0e+00     0.65    0.10    0.00  0e+00
         128            32     float     sum     4.99    0.03    0.00  0e+00     0.60    0.21    0.00  0e+00
         256            64     float     sum     4.81    0.05    0.00  0e+00     0.68    0.38    0.00  0e+00
         512           128     float     sum     4.89    0.10    0.00  0e+00     0.65    0.79    0.00  0e+00
        1024           256     float     sum     5.14    0.20    0.00  0e+00     0.70    1.47    0.00  0e+00
        2048           512     float     sum     5.10    0.40    0.00  0e+00     0.65    3.14    0.00  0e+00
        4096          1024     float     sum     5.01    0.82    0.00  0e+00     0.67    6.10    0.00  0e+00
        8192          2048     float     sum     4.85    1.69    0.00  0e+00     0.55   14.77    0.00  0e+00
       16384          4096     float     sum     4.53    3.62    0.00  0e+00     0.57   28.89    0.00  0e+00
       32768          8192     float     sum     4.45    7.37    0.00  0e+00     0.53   61.30    0.00  0e+00
       65536         16384     float     sum     4.49   14.59    0.00  0e+00     0.53  122.66    0.00  0e+00
      131072         32768     float     sum     4.45   29.48    0.00  0e+00     0.54  242.58    0.00  0e+00
      262144         65536     float     sum     4.46   58.84    0.00  0e+00     0.50  529.11    0.00  0e+00
      524288        131072     float     sum     4.41  118.76    0.00  0e+00     0.50  1056.71    0.00  0e+00
     1048576        262144     float     sum     6.62  158.41    0.00  0e+00     0.50  2105.79    0.00  0e+00
     2097152        524288     float     sum     7.42  282.76    0.00  0e+00     0.50  4226.94    0.00  0e+00
     4194304       1048576     float     sum    10.47  400.67    0.00  0e+00     0.50  8436.87    0.00  0e+00
     8388608       2097152     float     sum    15.52  540.55    0.00  0e+00     0.50  16755.43    0.00  0e+00
hai-a100-1: Test CUDA failure common.cu:762 'out of memory'
 .. hai-a100-1 pid 3022972: Test failure common.cu:1007
 .. hai-a100-1 pid 3022972: Test failure common.cu:925
hai-a100-1: Test CUDA failure common.cu:764 'out of memory'
 .. hai-a100-1 pid 3022971: Test failure common.cu:1007
 .. hai-a100-1 pid 3022971: Test failure common.cu:925
    16777216       4194304     float     sum    25.78  650.77    0.00  0e+00     0.50  33693.25    0.00  0e+00
    33554432       8388608     float     sum    47.40  707.97    0.00  0e+00     0.50  67332.41    0.00  0e+00
    67108864      16777216     float     sum    89.94  746.18    0.00  0e+00     0.49  137735.49    0.00  0e+00
   134217728      33554432     float     sum    173.5  773.45    0.00  0e+00     0.54  247118.97    0.00  0e+00
   268435456      67108864     float     sum    341.0  787.12    0.00  0e+00     0.51  523909.39    0.00  0e+00
   536870912     134217728     float     sum    682.8  786.28    0.00  0e+00     0.49  1092889.24    0.00  0e+00
  1073741824     268435456     float     sum   1368.5  784.62    0.00  0e+00     0.49  2197903.56    0.00  0e+00
srun: error: hai-a100-1: tasks 2-4: Exited with exit code 2
  2147483648     536870912     float     sum   3031.1  708.49    0.00  0e+00     0.48  4445951.82    0.00  0e+00
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           4             1     float     sum    28.50    0.00    0.00  0e+00     0.70    0.01    0.00  0e+00
           8             2     float     sum    30.09    0.00    0.00  0e+00     0.68    0.01    0.00  0e+00
          16             4     float     sum    30.09    0.00    0.00  0e+00     0.69    0.02    0.00  0e+00
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
          32             8     float     sum    28.73    0.00    0.00  0e+00     0.79    0.04    0.00  0e+00
          64            16     float     sum    51.91    0.00    0.00  0e+00     0.79    0.08    0.00  0e+00
         128            32     float     sum    31.73    0.00    0.00  0e+00     0.70    0.18    0.00  0e+00
           4             1     float     sum    33.91    0.00    0.00  0e+00     0.70    0.01    0.00  0e+00
         256            64     float     sum    30.22    0.01    0.00  0e+00     0.71    0.36    0.00  0e+00
           8             2     float     sum    32.45    0.00    0.00  0e+00     0.69    0.01    0.00  0e+00
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         512           128     float     sum    53.34    0.01    0.00  0e+00     0.70    0.73    0.00  0e+00
          16             4     float     sum    55.57    0.00    0.00  0e+00     0.69    0.02    0.00  0e+00
        1024           256     float     sum    74.91    0.01    0.00  0e+00     0.70    1.46    0.00  0e+00
          32             8     float     sum    77.15    0.00    0.00  0e+00     0.69    0.05    0.00  0e+00
           4             1     float     sum    52.78    0.00    0.00  0e+00     0.79    0.01    0.00  0e+00
        2048           512     float     sum    53.42    0.04    0.00  0e+00     0.70    2.92    0.00  0e+00
          64            16     float     sum    55.87    0.00    0.00  0e+00     0.69    0.09    0.00  0e+00
           4             1     float     sum    31.59    0.00    0.00  0e+00     0.70    0.01    0.00  0e+00
           8             2     float     sum    31.14    0.00    0.00  0e+00     0.71    0.01    0.00  0e+00
        4096          1024     float     sum    31.80    0.13    0.00  0e+00     0.70    5.86    0.00  0e+00
         128            32     float     sum    34.26    0.00    0.00  0e+00     0.69    0.19    0.00  0e+00
           8             2     float     sum    31.62    0.00    0.00  0e+00     0.70    0.01    0.00  0e+00
          16             4     float     sum    31.27    0.00    0.00  0e+00     0.69    0.02    0.00  0e+00
        8192          2048     float     sum    31.86    0.26    0.00  0e+00     0.70   11.70    0.00  0e+00
         256            64     float     sum    34.36    0.01    0.00  0e+00     0.69    0.37    0.00  0e+00
          16             4     float     sum    31.67    0.00    0.00  0e+00     0.71    0.02    0.00  0e+00
          32             8     float     sum    31.30    0.00    0.00  0e+00     0.69    0.05    0.00  0e+00
       16384          4096     float     sum    31.94    0.51    0.00  0e+00     0.71   23.24    0.00  0e+00
         512           128     float     sum    34.46    0.01    0.00  0e+00     0.70    0.73    0.00  0e+00
          32             8     float     sum    31.63    0.00    0.00  0e+00     0.70    0.05    0.00  0e+00
          64            16     float     sum    31.45    0.00    0.00  0e+00     0.69    0.09    0.00  0e+00
       32768          8192     float     sum    32.10    1.02    0.00  0e+00     0.70   46.97    0.00  0e+00
        1024           256     float     sum    34.68    0.03    0.00  0e+00     0.69    1.49    0.00  0e+00
          64            16     float     sum    31.67    0.00    0.00  0e+00     0.72    0.09    0.00  0e+00
         128            32     float     sum    31.51    0.00    0.00  0e+00     0.69    0.19    0.00  0e+00
       65536         16384     float     sum    32.15    2.04    0.00  0e+00     0.70   93.34    0.00  0e+00
        2048           512     float     sum    34.97    0.06    0.00  0e+00     0.69    2.98    0.00  0e+00
         128            32     float     sum    31.77    0.00    0.00  0e+00     0.70    0.18    0.00  0e+00
         256            64     float     sum    31.55    0.01    0.00  0e+00     0.68    0.37    0.00  0e+00
      131072         32768     float     sum    32.29    4.06    0.00  0e+00     0.70  185.96    0.00  0e+00
        4096          1024     float     sum    35.16    0.12    0.00  0e+00     0.69    5.92    0.00  0e+00
         256            64     float     sum    31.74    0.01    0.00  0e+00     0.70    0.36    0.00  0e+00
         512           128     float     sum    31.56    0.02    0.00  0e+00     0.69    0.75    0.00  0e+00
      262144         65536     float     sum    32.43    8.08    0.00  0e+00     0.70  375.77    0.00  0e+00
        8192          2048     float     sum    35.31    0.23    0.00  0e+00     0.69   11.89    0.00  0e+00
         512           128     float     sum    31.83    0.02    0.00  0e+00     0.76    0.67    0.00  0e+00
        1024           256     float     sum    31.64    0.03    0.00  0e+00     0.69    1.49    0.00  0e+00
      524288        131072     float     sum    32.56   16.10    0.00  0e+00     0.70  744.16    0.00  0e+00
       16384          4096     float     sum    35.56    0.46    0.00  0e+00     0.70   23.56    0.00  0e+00
        1024           256     float     sum    31.86    0.03    0.00  0e+00     0.70    1.46    0.00  0e+00
        2048           512     float     sum    31.29    0.07    0.00  0e+00     0.68    3.01    0.00  0e+00
     1048576        262144     float     sum    34.51   30.38    0.00  0e+00     0.70  1487.68    0.00  0e+00
       32768          8192     float     sum    37.60    0.87    0.00  0e+00     0.69   47.28    0.00  0e+00
        2048           512     float     sum    31.93    0.06    0.00  0e+00     0.70    2.93    0.00  0e+00
        4096          1024     float     sum    32.20    0.13    0.00  0e+00     0.70    5.87    0.00  0e+00
     2097152        524288     float     sum    35.37   59.29    0.00  0e+00     0.70  2999.27    0.00  0e+00
       65536         16384     float     sum    38.56    1.70    0.00  0e+00     0.69   94.57    0.00  0e+00
        4096          1024     float     sum    32.04    0.13    0.00  0e+00     0.70    5.85    0.00  0e+00
        8192          2048     float     sum    32.34    0.25    0.00  0e+00     0.69   11.96    0.00  0e+00
     4194304       1048576     float     sum    38.50  108.93    0.00  0e+00     0.70  6002.84    0.00  0e+00
      131072         32768     float     sum    41.79    3.14    0.00  0e+00     0.69  189.02    0.00  0e+00
        8192          2048     float     sum    32.32    0.25    0.00  0e+00     0.70   11.64    0.00  0e+00
       16384          4096     float     sum    32.83    0.50    0.00  0e+00     0.69   23.79    0.00  0e+00
     8388608       2097152     float     sum    43.36  193.45    0.00  0e+00     0.70  12043.60    0.00  0e+00
      262144         65536     float     sum    46.80    5.60    0.00  0e+00     0.69  381.02    0.00  0e+00
       16384          4096     float     sum    32.74    0.50    0.00  0e+00     0.71   23.11    0.00  0e+00
       32768          8192     float     sum    33.62    0.97    0.00  0e+00     0.69   47.62    0.00  0e+00
      524288        131072     float     sum    54.32    9.65    0.00  0e+00     0.69  760.49    0.00  0e+00
    16777216       4194304     float     sum    88.02  190.61    0.00  0e+00     0.70  24049.22    0.00  0e+00
       32768          8192     float     sum    33.60    0.98    0.00  0e+00     0.70   46.82    0.00  0e+00
       65536         16384     float     sum    33.88    1.93    0.00  0e+00     0.70   93.74    0.00  0e+00
     1048576        262144     float     sum    35.11   29.86    0.00  0e+00     0.80  1318.12    0.00  0e+00
    33554432       8388608     float     sum    141.2  237.71    0.00  0e+00     0.70  48008.97    0.00  0e+00
       65536         16384     float     sum    32.65    2.01    0.00  0e+00     0.71   92.55    0.00  0e+00
      131072         32768     float     sum    35.45    3.70    0.00  0e+00     0.69  189.11    0.00  0e+00
     2097152        524288     float     sum    38.59   54.35    0.00  0e+00     0.79  2648.92    0.00  0e+00
      131072         32768     float     sum    54.13    2.42    0.00  0e+00     0.70  186.57    0.00  0e+00
    67108864      16777216     float     sum    242.2  277.10    0.00  0e+00     0.78  86105.45    0.00  0e+00
      262144         65536     float     sum    38.86    6.75    0.00  0e+00     0.69  380.24    0.00  0e+00
     4194304       1048576     float     sum    39.40  106.45    0.00  0e+00     0.77  5455.30    0.00  0e+00
      262144         65536     float     sum    43.79    5.99    0.00  0e+00     0.70  373.94    0.00  0e+00
      524288        131072     float     sum    55.07    9.52    0.00  0e+00     0.69  760.61    0.00  0e+00
     8388608       2097152     float     sum    64.96  129.13    0.00  0e+00     0.80  10542.29    0.00  0e+00
      524288        131072     float     sum    54.96    9.54    0.00  0e+00     0.70  748.00    0.00  0e+00
   134217728      33554432     float     sum    469.6  285.80    0.00  0e+00     0.78  171067.34    0.00  0e+00
     1048576        262144     float     sum    47.17   22.23    0.00  0e+00     0.69  1525.62    0.00  0e+00
    16777216       4194304     float     sum    106.8  157.03    0.00  0e+00     0.79  21215.50    0.00  0e+00
     1048576        262144     float     sum    37.19   28.20    0.00  0e+00     0.70  1501.33    0.00  0e+00
     2097152        524288     float     sum    58.30   35.97    0.00  0e+00     0.69  3060.20    0.00  0e+00
    33554432       8388608     float     sum    199.9  167.89    0.00  0e+00     0.79  42479.34    0.00  0e+00
     2097152        524288     float     sum    57.25   36.63    0.00  0e+00     0.70  2987.27    0.00  0e+00
     4194304       1048576     float     sum    62.54   67.06    0.00  0e+00     0.69  6072.45    0.00  0e+00
     4194304       1048576     float     sum    73.25   57.26    0.00  0e+00     0.70  6005.42    0.00  0e+00
    67108864      16777216     float     sum    317.7  211.25    0.00  0e+00     0.69  97004.76    0.00  0e+00
     8388608       2097152     float     sum    50.53  166.00    0.00  0e+00     0.69  12222.95    0.00  0e+00
   268435456      67108864     float     sum    992.2  270.55    0.00  0e+00     0.70  383134.40    0.00  0e+00
     8388608       2097152     float     sum    56.47  148.56    0.00  0e+00     0.79  10605.07    0.00  0e+00
    16777216       4194304     float     sum    149.5  112.22    0.00  0e+00     0.72  23312.70    0.00  0e+00
   134217728      33554432     float     sum    638.1  210.35    0.00  0e+00     0.78  171198.27    0.00  0e+00
    16777216       4194304     float     sum    160.0  104.83    0.00  0e+00     0.70  23973.27    0.00  0e+00
    33554432       8388608     float     sum    225.6  148.71    0.00  0e+00     0.79  42469.13    0.00  0e+00
    33554432       8388608     float     sum    255.1  131.52    0.00  0e+00     0.79  42323.42    0.00  0e+00
    67108864      16777216     float     sum    463.2  144.87    0.00  0e+00     0.78  85664.69    0.00  0e+00
    67108864      16777216     float     sum    485.8  138.15    0.00  0e+00     0.80  84401.99    0.00  0e+00
   268435456      67108864     float     sum   1524.6  176.07    0.00  0e+00     0.78  342838.20    0.00  0e+00
   134217728      33554432     float     sum    747.2  179.62    0.00  0e+00     0.79  170936.62    0.00  0e+00
   536870912     134217728     float     sum   2759.9  194.52    0.00  0e+00     0.70  770569.11    0.00  0e+00
   134217728      33554432     float     sum    747.7  179.50    0.00  0e+00     0.81  166310.70    0.00  0e+00
  4294967296    1073741824     float     sum   9375.2  458.12    0.00  0e+00     0.50  8517366.63    0.00  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#
   268435456      67108864     float     sum   1676.5  160.11    0.00  0e+00     0.79  339276.36    0.00  0e+00
   268435456      67108864     float     sum   1715.2  156.50    0.00  0e+00     0.76  353185.96    0.00  0e+00
   536870912     134217728     float     sum   3070.5  174.85    0.00  0e+00     0.79  681057.62    0.00  0e+00
   536870912     134217728     float     sum   2909.6  184.51    0.00  0e+00     0.68  784669.56    0.00  0e+00
   536870912     134217728     float     sum   2978.0  180.28    0.00  0e+00     0.69  773466.61    0.00  0e+00
  1073741824     268435456     float     sum   5435.8  197.53    0.00  0e+00     0.80  1350261.97    0.00  0e+00
  1073741824     268435456     float     sum   5673.4  189.26    0.00  0e+00     0.78  1368538.76    0.00  0e+00
  1073741824     268435456     float     sum   5891.1  182.27    0.00  0e+00     0.69  1550507.32    0.00  0e+00
  1073741824     268435456     float     sum   5898.3  182.04    0.00  0e+00     0.69  1552300.57    0.00  0e+00
  2147483648     536870912     float     sum    11819  181.69    0.00  0e+00     0.80  2670235.69    0.00  0e+00
  2147483648     536870912     float     sum    11850  181.22    0.00  0e+00     0.80  2690037.26    0.00  0e+00
  2147483648     536870912     float     sum    11910  180.30    0.00  0e+00     0.71  3045471.32    0.00  0e+00
  2147483648     536870912     float     sum    11914  180.26    0.00  0e+00     0.70  3088526.91    0.00  0e+00
  4294967296    1073741824     float     sum    23910  179.63    0.00  0e+00     0.70  6145940.07    0.00  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#
  4294967296    1073741824     float     sum    24052  178.57    0.00  0e+00     0.73  5844367.59    0.00  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#
  4294967296    1073741824     float     sum    24076  178.39    0.00  0e+00     0.72  5973861.27    0.00  0e+00
  4294967296    1073741824     float     sum    24080  178.37    0.00  0e+00     0.70  6167119.88    0.00  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
flx42 commented 2 years ago

Sorry, where/when would I set MELLANOX_VISIBLE_DEVICE=1,3,6,8? Thanks.

Just export MELLANOX_VISIBLE_DEVICE=1,3,6,8 before the srun --container-image, but again your enroot config needs to have ENROOT_RESTRICT_DEV y.

But, using pmi2 the all reduce test seems to work

The ranks couldn't find each other, so each rank believes it is global rank 0, you are not supposed to see the output duplicated 8 times like this. Did you compile the NCCL tests with MPI=1?

andrew-johnson-melb commented 2 years ago

Hey,

Ah right, yes the make step failed with the MPI=1 flag so I removed it.


root@hai-a100-3:/home/python/app/nccl-tests-1f8f5416863a3082975b10eaa05fecee6fe870c8# make MPI=1
make -C src build
make[1]: Entering directory '/home/python/app/nccl-tests-1f8f5416863a3082975b10eaa05fecee6fe870c8/src'
Compiling  all_reduce.cu                       > ../build/all_reduce.o
In file included from all_reduce.cu:8:
common.h:15:10: fatal error: mpi.h: No such file or directory
   15 | #include "mpi.h"
      |          ^~~~~~~
compilation terminated.
make[1]: *** [Makefile:84: ../build/all_reduce.o] Error 1
make[1]: Leaving directory '/home/python/app/nccl-tests-1f8f5416863a3082975b10eaa05fecee6fe870c8/src'
make: *** [Makefile:17: src.build] Error 2

I tried updating MPI_HOME to /usr/lib/openmpi (as indicated by mpicc -showme) but it still did not work.

Also, one possible related issue.

Running srun bash -c "ulimit -l"I get 64. So it seems the max size that can be locked in mem is quite low. locally it's unlimited.

flx42 commented 2 years ago

You are using the TF 21.05 container, right? In this case OpenMPI should be in /usr/local/mpi, so try with make MPI=1 MPI_HOME=/usr/local/mpi

flx42 commented 2 years ago

srun bash -c "ulimit -l"I get 64. So it seems the max size that can be locked in mem is quite low. locally it's unlimited.

You could try running with just enroot after ssh'ing to the node, you won't be able to use PMI2 or PMIx support in this case, but for a single node run, mpirun should be fine.

andrew-johnson-melb commented 2 years ago

Hey, updating the resourse limit actually fixed that issue. Now, limit -l is unlimited. Thanks a lot for your help.

flx42 commented 2 years ago

Glad to know it's solved!