Closed andrew-johnson-melb closed 2 years ago
This might be a bit tricky to debug, but we can try :)
What's the version of MLNX OFED on the system? (kernel side). A command like this one might help:
$ dpkg -l | grep mlnx-ofed-kernel
ii mlnx-ofed-kernel-dkms 5.3-OFED.5.3.1.0.0.1 all DKMS support for mlnx-ofed kernel modules
ii mlnx-ofed-kernel-utils 5.3-OFED.5.3.1.0.0.1 amd64 Userspace tools to restart and tune mlnx-ofed kernel modules
NCCL INFO NET/IB : Using [0]mlx5_1:1/IB/SHARP [1]mlx5_3:1/IB/SHARP [2]mlx5_6:1/IB/SHARP [3]mlx5_8:1/IB/SHARP [4]mlx5_4:1/RoCE [5]mlx5_10:1/RoCE ; OOB enp226s0:10.16.2.21<0>
I'm not sure if that's the root cause, but it might be a problem to have the mlx5_4
and mlx5_10
devices exposed here, particularly in RoCE mode.
If you have ENROOT_RESTRICT_DEV y
in enroot.conf
, you can exclude those devices by running your container with MELLANOX_VISIBLE_DEVICES=1,3,6,8
. Verify that this line only shows IB devices after this change.
Can you install https://github.com/NVIDIA/nccl-tests.git inside the Dockerfile and try with the all_reduce_perf
binary?
Here is my Dockerfile recipe:
RUN cd /usr/local/src && \
NCCL_TESTS_VERSION="1f8f5416863a3082975b10eaa05fecee6fe870c8" && \
curl --proto '=https' -fSsL https://github.com/NVIDIA/nccl-tests/archive/${NCCL_TESTS_VERSION}.tar.gz | tar xz && \
cd nccl-tests-${NCCL_TESTS_VERSION} && \
make MPI=1 && \
install -m 755 build/all_* build/broadcast_* build/reduce_* /usr/local/bin
Then to launch it:
$ srun --container-image=<image> --mpi=pmix --ntasks-per-node=8 all_reduce_perf -b 4 -e 4G -f 2 -c 1 -n 100
I suspect this will also fail, and that would be a good step since TF would be out of the picture.
Hey Felix, thanks for the fast response!
ii mlnx-ofed-kernel-dkms 5.1-OFED.5.1.2.5.8.1 all DKMS support for mlnx-ofed kernel modules
ii mlnx-ofed-kernel-utils 5.1-OFED.5.1.2.5.8.1 amd64 Userspace tools to restart and tune mlnx-ofed kernel modules
Ah, this may be relevant, we currently don't have the pmix plugin
srun: MPI types are...
srun: none
srun: pmi2
srun: cray_shasta
But, using pmi2 the all reduce test seems to work
Sorry, where/when would I set MELLANOX_VISIBLE_DEVICE=1,3,6,8? Thanks.
andrew_johnson@mgmt01:~/git/ct_brain$ srun --nodelist hai-a100-1 --container-image=/mnt/shared/sqsh_files/ctb_nv_test.sqsh --mpi=pmi2 --ntasks-per-node=8 all_reduce_perf -b 4 -e 4G -f 2 -c 1 -n 10
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# Rank 0 Pid 3022974 on hai-a100-1 device 0 [0x07] A100-SXM-80GB
# Rank 0 Pid 3022976 on hai-a100-1 device 0 [0x07] A100-SXM-80GB
# Rank 0 Pid 3022975 on hai-a100-1 device 0 [0x07] A100-SXM-80GB
# Rank 0 Pid 3022969 on hai-a100-1 device 0 [0x07] A100-SXM-80GB
# Rank 0 Pid 3022970 on hai-a100-1 device 0 [0x07] A100-SXM-80GB
# Rank 0 Pid 3022973 on hai-a100-1 device 0 [0x07] A100-SXM-80GB
# Rank 0 Pid 3022971 on hai-a100-1 device 0 [0x07] A100-SXM-80GB
# Rank 0 Pid 3022972 on hai-a100-1 device 0 [0x07] A100-SXM-80GB
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
4 1 float sum 5.08 0.00 0.00 0e+00 0.56 0.01 0.00 0e+00
hai-a100-1: Test CUDA failure common.cu:763 'out of memory'
.. hai-a100-1 pid 3022973: Test failure common.cu:1007
.. hai-a100-1 pid 3022973: Test failure common.cu:925
8 2 float sum 5.06 0.00 0.00 0e+00 0.68 0.01 0.00 0e+00
16 4 float sum 5.12 0.00 0.00 0e+00 0.69 0.02 0.00 0e+00
32 8 float sum 4.85 0.01 0.00 0e+00 0.62 0.05 0.00 0e+00
64 16 float sum 4.91 0.01 0.00 0e+00 0.65 0.10 0.00 0e+00
128 32 float sum 4.99 0.03 0.00 0e+00 0.60 0.21 0.00 0e+00
256 64 float sum 4.81 0.05 0.00 0e+00 0.68 0.38 0.00 0e+00
512 128 float sum 4.89 0.10 0.00 0e+00 0.65 0.79 0.00 0e+00
1024 256 float sum 5.14 0.20 0.00 0e+00 0.70 1.47 0.00 0e+00
2048 512 float sum 5.10 0.40 0.00 0e+00 0.65 3.14 0.00 0e+00
4096 1024 float sum 5.01 0.82 0.00 0e+00 0.67 6.10 0.00 0e+00
8192 2048 float sum 4.85 1.69 0.00 0e+00 0.55 14.77 0.00 0e+00
16384 4096 float sum 4.53 3.62 0.00 0e+00 0.57 28.89 0.00 0e+00
32768 8192 float sum 4.45 7.37 0.00 0e+00 0.53 61.30 0.00 0e+00
65536 16384 float sum 4.49 14.59 0.00 0e+00 0.53 122.66 0.00 0e+00
131072 32768 float sum 4.45 29.48 0.00 0e+00 0.54 242.58 0.00 0e+00
262144 65536 float sum 4.46 58.84 0.00 0e+00 0.50 529.11 0.00 0e+00
524288 131072 float sum 4.41 118.76 0.00 0e+00 0.50 1056.71 0.00 0e+00
1048576 262144 float sum 6.62 158.41 0.00 0e+00 0.50 2105.79 0.00 0e+00
2097152 524288 float sum 7.42 282.76 0.00 0e+00 0.50 4226.94 0.00 0e+00
4194304 1048576 float sum 10.47 400.67 0.00 0e+00 0.50 8436.87 0.00 0e+00
8388608 2097152 float sum 15.52 540.55 0.00 0e+00 0.50 16755.43 0.00 0e+00
hai-a100-1: Test CUDA failure common.cu:762 'out of memory'
.. hai-a100-1 pid 3022972: Test failure common.cu:1007
.. hai-a100-1 pid 3022972: Test failure common.cu:925
hai-a100-1: Test CUDA failure common.cu:764 'out of memory'
.. hai-a100-1 pid 3022971: Test failure common.cu:1007
.. hai-a100-1 pid 3022971: Test failure common.cu:925
16777216 4194304 float sum 25.78 650.77 0.00 0e+00 0.50 33693.25 0.00 0e+00
33554432 8388608 float sum 47.40 707.97 0.00 0e+00 0.50 67332.41 0.00 0e+00
67108864 16777216 float sum 89.94 746.18 0.00 0e+00 0.49 137735.49 0.00 0e+00
134217728 33554432 float sum 173.5 773.45 0.00 0e+00 0.54 247118.97 0.00 0e+00
268435456 67108864 float sum 341.0 787.12 0.00 0e+00 0.51 523909.39 0.00 0e+00
536870912 134217728 float sum 682.8 786.28 0.00 0e+00 0.49 1092889.24 0.00 0e+00
1073741824 268435456 float sum 1368.5 784.62 0.00 0e+00 0.49 2197903.56 0.00 0e+00
srun: error: hai-a100-1: tasks 2-4: Exited with exit code 2
2147483648 536870912 float sum 3031.1 708.49 0.00 0e+00 0.48 4445951.82 0.00 0e+00
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
4 1 float sum 28.50 0.00 0.00 0e+00 0.70 0.01 0.00 0e+00
8 2 float sum 30.09 0.00 0.00 0e+00 0.68 0.01 0.00 0e+00
16 4 float sum 30.09 0.00 0.00 0e+00 0.69 0.02 0.00 0e+00
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
32 8 float sum 28.73 0.00 0.00 0e+00 0.79 0.04 0.00 0e+00
64 16 float sum 51.91 0.00 0.00 0e+00 0.79 0.08 0.00 0e+00
128 32 float sum 31.73 0.00 0.00 0e+00 0.70 0.18 0.00 0e+00
4 1 float sum 33.91 0.00 0.00 0e+00 0.70 0.01 0.00 0e+00
256 64 float sum 30.22 0.01 0.00 0e+00 0.71 0.36 0.00 0e+00
8 2 float sum 32.45 0.00 0.00 0e+00 0.69 0.01 0.00 0e+00
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
512 128 float sum 53.34 0.01 0.00 0e+00 0.70 0.73 0.00 0e+00
16 4 float sum 55.57 0.00 0.00 0e+00 0.69 0.02 0.00 0e+00
1024 256 float sum 74.91 0.01 0.00 0e+00 0.70 1.46 0.00 0e+00
32 8 float sum 77.15 0.00 0.00 0e+00 0.69 0.05 0.00 0e+00
4 1 float sum 52.78 0.00 0.00 0e+00 0.79 0.01 0.00 0e+00
2048 512 float sum 53.42 0.04 0.00 0e+00 0.70 2.92 0.00 0e+00
64 16 float sum 55.87 0.00 0.00 0e+00 0.69 0.09 0.00 0e+00
4 1 float sum 31.59 0.00 0.00 0e+00 0.70 0.01 0.00 0e+00
8 2 float sum 31.14 0.00 0.00 0e+00 0.71 0.01 0.00 0e+00
4096 1024 float sum 31.80 0.13 0.00 0e+00 0.70 5.86 0.00 0e+00
128 32 float sum 34.26 0.00 0.00 0e+00 0.69 0.19 0.00 0e+00
8 2 float sum 31.62 0.00 0.00 0e+00 0.70 0.01 0.00 0e+00
16 4 float sum 31.27 0.00 0.00 0e+00 0.69 0.02 0.00 0e+00
8192 2048 float sum 31.86 0.26 0.00 0e+00 0.70 11.70 0.00 0e+00
256 64 float sum 34.36 0.01 0.00 0e+00 0.69 0.37 0.00 0e+00
16 4 float sum 31.67 0.00 0.00 0e+00 0.71 0.02 0.00 0e+00
32 8 float sum 31.30 0.00 0.00 0e+00 0.69 0.05 0.00 0e+00
16384 4096 float sum 31.94 0.51 0.00 0e+00 0.71 23.24 0.00 0e+00
512 128 float sum 34.46 0.01 0.00 0e+00 0.70 0.73 0.00 0e+00
32 8 float sum 31.63 0.00 0.00 0e+00 0.70 0.05 0.00 0e+00
64 16 float sum 31.45 0.00 0.00 0e+00 0.69 0.09 0.00 0e+00
32768 8192 float sum 32.10 1.02 0.00 0e+00 0.70 46.97 0.00 0e+00
1024 256 float sum 34.68 0.03 0.00 0e+00 0.69 1.49 0.00 0e+00
64 16 float sum 31.67 0.00 0.00 0e+00 0.72 0.09 0.00 0e+00
128 32 float sum 31.51 0.00 0.00 0e+00 0.69 0.19 0.00 0e+00
65536 16384 float sum 32.15 2.04 0.00 0e+00 0.70 93.34 0.00 0e+00
2048 512 float sum 34.97 0.06 0.00 0e+00 0.69 2.98 0.00 0e+00
128 32 float sum 31.77 0.00 0.00 0e+00 0.70 0.18 0.00 0e+00
256 64 float sum 31.55 0.01 0.00 0e+00 0.68 0.37 0.00 0e+00
131072 32768 float sum 32.29 4.06 0.00 0e+00 0.70 185.96 0.00 0e+00
4096 1024 float sum 35.16 0.12 0.00 0e+00 0.69 5.92 0.00 0e+00
256 64 float sum 31.74 0.01 0.00 0e+00 0.70 0.36 0.00 0e+00
512 128 float sum 31.56 0.02 0.00 0e+00 0.69 0.75 0.00 0e+00
262144 65536 float sum 32.43 8.08 0.00 0e+00 0.70 375.77 0.00 0e+00
8192 2048 float sum 35.31 0.23 0.00 0e+00 0.69 11.89 0.00 0e+00
512 128 float sum 31.83 0.02 0.00 0e+00 0.76 0.67 0.00 0e+00
1024 256 float sum 31.64 0.03 0.00 0e+00 0.69 1.49 0.00 0e+00
524288 131072 float sum 32.56 16.10 0.00 0e+00 0.70 744.16 0.00 0e+00
16384 4096 float sum 35.56 0.46 0.00 0e+00 0.70 23.56 0.00 0e+00
1024 256 float sum 31.86 0.03 0.00 0e+00 0.70 1.46 0.00 0e+00
2048 512 float sum 31.29 0.07 0.00 0e+00 0.68 3.01 0.00 0e+00
1048576 262144 float sum 34.51 30.38 0.00 0e+00 0.70 1487.68 0.00 0e+00
32768 8192 float sum 37.60 0.87 0.00 0e+00 0.69 47.28 0.00 0e+00
2048 512 float sum 31.93 0.06 0.00 0e+00 0.70 2.93 0.00 0e+00
4096 1024 float sum 32.20 0.13 0.00 0e+00 0.70 5.87 0.00 0e+00
2097152 524288 float sum 35.37 59.29 0.00 0e+00 0.70 2999.27 0.00 0e+00
65536 16384 float sum 38.56 1.70 0.00 0e+00 0.69 94.57 0.00 0e+00
4096 1024 float sum 32.04 0.13 0.00 0e+00 0.70 5.85 0.00 0e+00
8192 2048 float sum 32.34 0.25 0.00 0e+00 0.69 11.96 0.00 0e+00
4194304 1048576 float sum 38.50 108.93 0.00 0e+00 0.70 6002.84 0.00 0e+00
131072 32768 float sum 41.79 3.14 0.00 0e+00 0.69 189.02 0.00 0e+00
8192 2048 float sum 32.32 0.25 0.00 0e+00 0.70 11.64 0.00 0e+00
16384 4096 float sum 32.83 0.50 0.00 0e+00 0.69 23.79 0.00 0e+00
8388608 2097152 float sum 43.36 193.45 0.00 0e+00 0.70 12043.60 0.00 0e+00
262144 65536 float sum 46.80 5.60 0.00 0e+00 0.69 381.02 0.00 0e+00
16384 4096 float sum 32.74 0.50 0.00 0e+00 0.71 23.11 0.00 0e+00
32768 8192 float sum 33.62 0.97 0.00 0e+00 0.69 47.62 0.00 0e+00
524288 131072 float sum 54.32 9.65 0.00 0e+00 0.69 760.49 0.00 0e+00
16777216 4194304 float sum 88.02 190.61 0.00 0e+00 0.70 24049.22 0.00 0e+00
32768 8192 float sum 33.60 0.98 0.00 0e+00 0.70 46.82 0.00 0e+00
65536 16384 float sum 33.88 1.93 0.00 0e+00 0.70 93.74 0.00 0e+00
1048576 262144 float sum 35.11 29.86 0.00 0e+00 0.80 1318.12 0.00 0e+00
33554432 8388608 float sum 141.2 237.71 0.00 0e+00 0.70 48008.97 0.00 0e+00
65536 16384 float sum 32.65 2.01 0.00 0e+00 0.71 92.55 0.00 0e+00
131072 32768 float sum 35.45 3.70 0.00 0e+00 0.69 189.11 0.00 0e+00
2097152 524288 float sum 38.59 54.35 0.00 0e+00 0.79 2648.92 0.00 0e+00
131072 32768 float sum 54.13 2.42 0.00 0e+00 0.70 186.57 0.00 0e+00
67108864 16777216 float sum 242.2 277.10 0.00 0e+00 0.78 86105.45 0.00 0e+00
262144 65536 float sum 38.86 6.75 0.00 0e+00 0.69 380.24 0.00 0e+00
4194304 1048576 float sum 39.40 106.45 0.00 0e+00 0.77 5455.30 0.00 0e+00
262144 65536 float sum 43.79 5.99 0.00 0e+00 0.70 373.94 0.00 0e+00
524288 131072 float sum 55.07 9.52 0.00 0e+00 0.69 760.61 0.00 0e+00
8388608 2097152 float sum 64.96 129.13 0.00 0e+00 0.80 10542.29 0.00 0e+00
524288 131072 float sum 54.96 9.54 0.00 0e+00 0.70 748.00 0.00 0e+00
134217728 33554432 float sum 469.6 285.80 0.00 0e+00 0.78 171067.34 0.00 0e+00
1048576 262144 float sum 47.17 22.23 0.00 0e+00 0.69 1525.62 0.00 0e+00
16777216 4194304 float sum 106.8 157.03 0.00 0e+00 0.79 21215.50 0.00 0e+00
1048576 262144 float sum 37.19 28.20 0.00 0e+00 0.70 1501.33 0.00 0e+00
2097152 524288 float sum 58.30 35.97 0.00 0e+00 0.69 3060.20 0.00 0e+00
33554432 8388608 float sum 199.9 167.89 0.00 0e+00 0.79 42479.34 0.00 0e+00
2097152 524288 float sum 57.25 36.63 0.00 0e+00 0.70 2987.27 0.00 0e+00
4194304 1048576 float sum 62.54 67.06 0.00 0e+00 0.69 6072.45 0.00 0e+00
4194304 1048576 float sum 73.25 57.26 0.00 0e+00 0.70 6005.42 0.00 0e+00
67108864 16777216 float sum 317.7 211.25 0.00 0e+00 0.69 97004.76 0.00 0e+00
8388608 2097152 float sum 50.53 166.00 0.00 0e+00 0.69 12222.95 0.00 0e+00
268435456 67108864 float sum 992.2 270.55 0.00 0e+00 0.70 383134.40 0.00 0e+00
8388608 2097152 float sum 56.47 148.56 0.00 0e+00 0.79 10605.07 0.00 0e+00
16777216 4194304 float sum 149.5 112.22 0.00 0e+00 0.72 23312.70 0.00 0e+00
134217728 33554432 float sum 638.1 210.35 0.00 0e+00 0.78 171198.27 0.00 0e+00
16777216 4194304 float sum 160.0 104.83 0.00 0e+00 0.70 23973.27 0.00 0e+00
33554432 8388608 float sum 225.6 148.71 0.00 0e+00 0.79 42469.13 0.00 0e+00
33554432 8388608 float sum 255.1 131.52 0.00 0e+00 0.79 42323.42 0.00 0e+00
67108864 16777216 float sum 463.2 144.87 0.00 0e+00 0.78 85664.69 0.00 0e+00
67108864 16777216 float sum 485.8 138.15 0.00 0e+00 0.80 84401.99 0.00 0e+00
268435456 67108864 float sum 1524.6 176.07 0.00 0e+00 0.78 342838.20 0.00 0e+00
134217728 33554432 float sum 747.2 179.62 0.00 0e+00 0.79 170936.62 0.00 0e+00
536870912 134217728 float sum 2759.9 194.52 0.00 0e+00 0.70 770569.11 0.00 0e+00
134217728 33554432 float sum 747.7 179.50 0.00 0e+00 0.81 166310.70 0.00 0e+00
4294967296 1073741824 float sum 9375.2 458.12 0.00 0e+00 0.50 8517366.63 0.00 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0
#
268435456 67108864 float sum 1676.5 160.11 0.00 0e+00 0.79 339276.36 0.00 0e+00
268435456 67108864 float sum 1715.2 156.50 0.00 0e+00 0.76 353185.96 0.00 0e+00
536870912 134217728 float sum 3070.5 174.85 0.00 0e+00 0.79 681057.62 0.00 0e+00
536870912 134217728 float sum 2909.6 184.51 0.00 0e+00 0.68 784669.56 0.00 0e+00
536870912 134217728 float sum 2978.0 180.28 0.00 0e+00 0.69 773466.61 0.00 0e+00
1073741824 268435456 float sum 5435.8 197.53 0.00 0e+00 0.80 1350261.97 0.00 0e+00
1073741824 268435456 float sum 5673.4 189.26 0.00 0e+00 0.78 1368538.76 0.00 0e+00
1073741824 268435456 float sum 5891.1 182.27 0.00 0e+00 0.69 1550507.32 0.00 0e+00
1073741824 268435456 float sum 5898.3 182.04 0.00 0e+00 0.69 1552300.57 0.00 0e+00
2147483648 536870912 float sum 11819 181.69 0.00 0e+00 0.80 2670235.69 0.00 0e+00
2147483648 536870912 float sum 11850 181.22 0.00 0e+00 0.80 2690037.26 0.00 0e+00
2147483648 536870912 float sum 11910 180.30 0.00 0e+00 0.71 3045471.32 0.00 0e+00
2147483648 536870912 float sum 11914 180.26 0.00 0e+00 0.70 3088526.91 0.00 0e+00
4294967296 1073741824 float sum 23910 179.63 0.00 0e+00 0.70 6145940.07 0.00 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0
#
4294967296 1073741824 float sum 24052 178.57 0.00 0e+00 0.73 5844367.59 0.00 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0
#
4294967296 1073741824 float sum 24076 178.39 0.00 0e+00 0.72 5973861.27 0.00 0e+00
4294967296 1073741824 float sum 24080 178.37 0.00 0e+00 0.70 6167119.88 0.00 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0
#
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0
Sorry, where/when would I set MELLANOX_VISIBLE_DEVICE=1,3,6,8? Thanks.
Just export MELLANOX_VISIBLE_DEVICE=1,3,6,8
before the srun --container-image
, but again your enroot config needs to have ENROOT_RESTRICT_DEV y
.
But, using pmi2 the all reduce test seems to work
The ranks couldn't find each other, so each rank believes it is global rank 0, you are not supposed to see the output duplicated 8 times like this. Did you compile the NCCL tests with MPI=1
?
Hey,
Ah right, yes the make step failed with the MPI=1 flag so I removed it.
root@hai-a100-3:/home/python/app/nccl-tests-1f8f5416863a3082975b10eaa05fecee6fe870c8# make MPI=1
make -C src build
make[1]: Entering directory '/home/python/app/nccl-tests-1f8f5416863a3082975b10eaa05fecee6fe870c8/src'
Compiling all_reduce.cu > ../build/all_reduce.o
In file included from all_reduce.cu:8:
common.h:15:10: fatal error: mpi.h: No such file or directory
15 | #include "mpi.h"
| ^~~~~~~
compilation terminated.
make[1]: *** [Makefile:84: ../build/all_reduce.o] Error 1
make[1]: Leaving directory '/home/python/app/nccl-tests-1f8f5416863a3082975b10eaa05fecee6fe870c8/src'
make: *** [Makefile:17: src.build] Error 2
I tried updating MPI_HOME
to /usr/lib/openmpi
(as indicated by mpicc -showme) but it still did not work.
Also, one possible related issue.
Running
srun bash -c "ulimit -l"
I get 64. So it seems the max size that can be locked in mem is quite low. locally it's unlimited.
You are using the TF 21.05 container, right? In this case OpenMPI should be in /usr/local/mpi
, so try with make MPI=1 MPI_HOME=/usr/local/mpi
srun bash -c "ulimit -l"I get 64. So it seems the max size that can be locked in mem is quite low. locally it's unlimited.
You could try running with just enroot
after ssh'ing to the node, you won't be able to use PMI2 or PMIx support in this case, but for a single node run, mpirun
should be fine.
Hey, updating the resourse limit actually fixed that issue. Now, limit -l is unlimited. Thanks a lot for your help.
Glad to know it's solved!
This is really a combined problem with slurm + tensorflow + pyxis. I'm yet to hear anything from the TF team, so I was hoping you might have an idea @flx42 (any suggestions would be very much appreciated).
Something weird happens with NCCL when inside an enroot container submitted via slurm with pyxis. Essentially, the tf.distribute.MirroredStrategy strategy fails, due to NCCL errors. I've provided all the information below.
Again, I understand this is not entirely appropriate for this repo. But any help would be great. This issue is at the boundary between a few things.
System information
The distributed training runs fails when training via slurm (using srun).
The code is run inside an enroot container. Due to slurm this container has a number of slurm specific environment variables set.
So, using
MirroredStrategy
to distribute training fails due to NCCL errors on a simple example.Note, a number of other distributed options work as is highlighted in the code.
NOTE, this code works fine outside of the slurm environment (in the exact same container). The slurm environment variables seem to be creating an issue with NCCL.
The srun command looks like
Error
Error when using 21.07 container and tf-nightly
Function call stack: train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function -> train_function