Closed rgandikota closed 6 months ago
@minseokl Could you help investigate this issue with running NVIDIA's MLPerf Training workload?
hi @rgandikota what's the configuration you are using? Is it this one? If yes, could you turn off cuda graph and overlap to see if there is more specific error message? The cuda graph and overlap can be turned off here
Hi @shijieliu. We ran the same training after turning off both cuda graphs and overlap. The error is still the same. Please find the full stack trace below.
Wanted to highlight a Warning we are seeing. Not sure if this can cause issues.
[1701758842.513479] [735c14efcde7:878 :0] ucp_context.c:1774 UCX WARN UCP version is incompatible, required: 1.15, actual: 1.12 (release 1) [1701758842.525161] [735c14efcde7:878 :0] ucp_context.c:1774 UCX WARN UCP version is incompatible, required: 1.15, actual: 1.12 (release 1) HugeCTR Version: 23.8
=====================================================ModelFit===================================================== [HCTR][06:48:08.684][INFO][RK0][main]: Use non-epoch mode with number of iterations: 512110 [HCTR][06:48:08.684][INFO][RK0][main]: Training batchsize: 8192, evaluation batchsize: 16384 [HCTR][06:48:08.684][INFO][RK0][main]: Evaluation interval: 25605, snapshot interval: 2000000 [HCTR][06:48:08.684][INFO][RK0][main]: Dense network trainable: True [HCTR][06:48:08.684][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: False [HCTR][06:48:08.684][INFO][RK0][main]: lr: 0.005000, warmup_steps: 0, end_lr: 0.000000 [HCTR][06:48:08.684][INFO][RK0][main]: decay_start: 0, decay_steps: 0, decay_power: 2.000000 [HCTR][06:48:08.684][INFO][RK0][main]: Training source file: /data/train_data.bin [HCTR][06:48:08.684][INFO][RK0][main]: Evaluation source file: /data_val/val_data.bin :::MLLOG {"namespace": "", "time_ms": 1701758888684, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/workspace/dlrm/mlperf_logger/callbacks.py", "lineno": 50}} :::MLLOG {"namespace": "", "time_ms": 1701758888685, "event_type": "INTERVAL_START", "key": "run_start", "value": null, "metadata": {"file": "/workspace/dlrm/mlperf_logger/callbacks.py", "lineno": 50}} :::MLLOG {"namespace": "", "time_ms": 1701758888685, "event_type": "INTERVAL_START", "key": "epoch_start", "value": null, "metadata": {"file": "/workspace/dlrm/mlperf_logger/callbacks.py", "lineno": 51, "epoch_num": 0}} terminate called after throwing an instance of 'HugeCTR::core23::RuntimeError' what(): Runtime error: CUDNN_STATUS_MAPPING_ERROR cudnnSetStream(cudnn_handle_, current_stream) (set_stream @ /workspace/dlrm/hugectr/HugeCTR/include/gpu_resource.hpp:80) [735c14efcde7:00878] *** Process received signal *** [735c14efcde7:00878] Signal: Aborted (6) [735c14efcde7:00878] Signal code: (-6) [735c14efcde7:00878] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ffba5754520] [735c14efcde7:00878] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7ffba57a8a7c] [735c14efcde7:00878] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7ffba5754476] [735c14efcde7:00878] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7ffba573a7f3] [735c14efcde7:00878] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7ffb9c807b9e] [735c14efcde7:00878] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7ffb9c81320c] [735c14efcde7:00878] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7ffb9c8121e9] [735c14efcde7:00878] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7ffb9c812959] [735c14efcde7:00878] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7ffba29f6884] [735c14efcde7:00878] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7ffba29f6f41] [735c14efcde7:00878] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x7ffb9c8134cb] [735c14efcde7:00878] [11] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR11GPUResource10set_streamERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0x345)[0x7ffaeec42895] [735c14efcde7:00878] [12] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR25StreamContextScheduleable3runESt10shared_ptrINS_11GPUResourceEEb+0x46d)[0x7ffaeec41f8d] [735c14efcde7:00878] [13] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR8Pipeline3runEv+0x10b)[0x7ffaeec40beb] [735c14efcde7:00878] [14] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xe0383c)[0x7ffaeece483c] [735c14efcde7:00878] [15] /usr/lib/x86_64-linux-gnu/libgomp.so.1(+0x1dc0e)[0x7ffb9c738c0e] [735c14efcde7:00878] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7ffba57a6b43] [735c14efcde7:00878] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7ffba5838a00] [735c14efcde7:00878] *** End of error message ***
It will be helpful if you could make the ucp version compatible.
What's your command line arguments passing to the training script?
@shijieliu - Here are the details (Details that was run on single node)
Test run 1: We set the environment as configured here
root@5795011ad9d8:/workspace/dlrm# source config_DGXH100_1x8x6912.sh root@5795011ad9d8:/workspace/dlrm# python train.py
docker run --shm-size=1g --ulimit memlock=-1 --cap-add=sys_nice --security-opt seccomp=unconfined --runtime=nvidia --rm -it -v /mnt/weka/mlperf/data/dlrm/dataset/criteo_binary/:/data -v /mnt/weka/mlperf/data/dlrm/dataset/criteo_binary/:/data_val -it dlrm-mlperf:1
Test run 2: with default values docker run --shm-size=1g --ulimit memlock=-1 --cap-add=sys_nice --security-opt seccomp=unconfined --runtime=nvidia --rm -it -v /dlrm/dataset/criteo_binary/:/data -v /dlrm/dataset/criteo_binary/:/data_val -it dlrm-mlperf: Test run 2: with default values as defined in code. code location
root@5795011ad9d8:/workspace/dlrm# python train.py
I run with the default value and it works well. This is my log.txt.
I am suspecting the data preprocessing maybe wrong which leads to illegal memory access because of overflow in embedding input. Could you use the following code to check your dataset?
import os
import numpy as np
import struct
import random
from tqdm import tqdm
alpha = 1.1
output_filename = '/data/train_data.bin'
dataset_info = [
(1, 3, 40000000),
(1, 2, 39060),
(1, 1, 17295),
(1, 2, 7424),
(1, 6, 20265),
(1, 1, 3),
(1, 1, 7122),
(1, 1, 1543),
(1, 1, 63),
(1, 7, 40000000),
(1, 3, 3067956),
(1, 8, 405282),
(1, 1, 10),
(1, 6, 2209),
(1, 9, 11938),
(1, 5, 155),
(1, 1, 4),
(1, 1, 976),
(1, 1, 14),
(1, 12, 40000000),
(1, 100, 40000000),
(1, 27, 40000000),
(1, 10, 590152),
(1, 3, 12973),
(1, 1, 108),
(1, 1, 36),
]
offset = np.cumsum([0] + [v[1] for v in dataset_info])
num_dense_features = 13
num_label = 1
total_samples = 4195197692
n = 1024
num_cate_features = sum([num_table * hotness for (num_table, hotness, _) in dataset_info])
max_vocabulary_size = sum([v for _, _, v in dataset_info])
item_num_per_sample = 1 + num_dense_features + num_cate_features
sample_format = r"1I" + str(num_dense_features) + "f" + str(num_cate_features) + "I"
sample_size_in_bytes = 1 * 4 + num_dense_features * 4 + num_cate_features * 4
min_vocabulary_size = np.asarray([np.iinfo(np.int64).max for _ in range(26)])
max_vocabulary_size = np.asarray([np.iinfo(np.int64).min for _ in range(26)])
assert os.path.getsize(output_filename) == total_samples * sample_size_in_bytes
with open(output_filename, "rb") as file:
for i in tqdm(range(100000)):
samples_bytes = file.read(sample_size_in_bytes * n)
samples = struct.unpack(sample_format * n, samples_bytes)
samples = np.asarray(samples).reshape(n, -1)
cate_features = samples[:, num_dense_features + num_label:].astype(np.int64)
min_cate_features = np.asarray([np.min(cate_features[:, start: end].reshape(-1)) for start, end in zip(offset, offset[1:])])
min_vocabulary_size = np.minimum(min_cate_features, min_vocabulary_size)
max_cate_features = np.asarray([np.max(cate_features[:, start: end].reshape(-1)) for start, end in zip(offset, offset[1:])])
max_vocabulary_size = np.maximum(max_cate_features, max_vocabulary_size)
print('min_vocabulary_size', min_vocabulary_size)
print('max_vocabulary_size', max_vocabulary_size)
The script prints the range of input category and in my side the output is
min_vocabulary_size [0 1 2 1 0 0 2 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2]
max_vocabulary_size [39999999 39059 17176 7421 20264 2 7085 1534 62 39999996 3067955 405281 9 2208 11937 154 3 973 13 39999999 39999999 39999999 590151 12972 107 34]
You can share your output so I can help check.
@shijieliu The script for validating our data fails on a size assertion by the looks of it. Please find the output below:
python test_preprocessing.py Traceback (most recent call last): File "test_preprocessing.py", line 55, in <module> assert os.path.getsize(output_filename) == total_samples * sample_size_in_bytes AssertionError
Here is the process we used to preprocess our data using NVTabular to speed up the preprocessing: https://github.com/pytorch/torchrec/tree/main/torchrec/datasets/scripts/nvt
Is the script you shared specific to the output from this method of preprocessing mentioned in this ReadMe using CPU only? https://github.com/mlcommons/training_results_v3.1/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr
Is the script you shared specific to the output from this method of preprocessing mentioned in this ReadMe using CPU only?
Yes. This ReadMe should match with our training scripts. There is at least one difference I can tell beteen the one using NVTabular and our ReadMe is that the one using NVTabular lacks converting one-hot dataset to multi-hot dataset which is the step 1.5 in our ReadMe.
Thanks much @shijieliu. Is there anyway we can download pre-processed data? That will be a great help if we can download the data. This will reduce lot of time and confusion instead repeating the steps which may go wrong.
@shijieliu - Thanks for providing the snippet. It was helpful in verifying the dataset.
We have executed and tested the code you provided on the dataset. You were right in pointing out that the conversion from a one-hot dataset to a multi-hot dataset was missing in a previous step.
However, we are currently facing another issue on the A100 (multi-node) platform with HUGECTR, specifically regarding communication.
fyi - we were able to successfully run BERT.
Any input or information would be greatly appreciated and will assist us in moving forward.
A100-02:2358430:2358556 [1] [proxy.cc:1495](http://proxy.cc:1495/) NCCL WARN [Service thread] Could not receive type from localRank 7, res=3, closed=0
A100-02:2358430:2358556 [1] [proxy.cc:1519](http://proxy.cc:1519/) NCCL WARN [Proxy Service 9] Failed to execute operation Connect from rank 15, retcode 3 A100-02:2358430:2358556 [1] NCCL INFO socketProgressOpt: abort called A100-02:2358430:2358556 [1] NCCL INFO misc/socket.cc:805 -> 3
A100-02:2358430:2358556 [1] [proxy.cc:1495](http://proxy.cc:1495/) NCCL WARN [Service thread] Could not receive type from localRank 6, res=3, closed=0
A100-02:2358430:2358556 [1] [proxy.cc:1519](http://proxy.cc:1519/) NCCL WARN [Proxy Service 9] Failed to execute operation Connect from rank 14, retcode 3
Traceback (most recent call last):
File "/workspace/dlrm/train.py", line 368, in
/workspace/dlrm# python validate_dataset.py 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [28:09<00:00, 59.19it/s] min_vocabulary_size [2 1 2 1 0 0 2 2 2 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2] max_vocabulary_size [39999999 39059 15213 7421 20264 2 6776 1348 62 39999999 3067955 405281 9 2208 11937 154 3 973 13 39999999 39999999 39999999 590151 12972 98 34]
Hi @rgandikota glad to see the dataset seems right!
And for your questions, it seems that the nccl init failed which happens before traning starts. Could you try set NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT
and see if there is more sepcific information? Thanks!
Please find the log attached with suggested configuration. (Anything more than 2 nodes results with an error)
Example run on 4 nodes: slurm-938.log
Error Snippet: A100-08:3103687:3103857 [2] ibvwrap.c:115 NCCL WARN Call to ibv_reg_mr_iova2 failed A100-08:3103687:3103857 [2] NCCL INFO ib_plugin.c:634 -> 2 A100-08:3103687:3103857 [2] NCCL INFO transport/net.cc:680 -> 2 A100-08:3103687:3103857 [2] NCCL INFO proxy.cc:1306 -> 2 A100-08:3103687:3103857 [2] NCCL INFO proxy.cc:1377 -> 2
Note: It works with 2 nodes [slurm-935_2nodes.log]
A100-02:56101:56272 [3] ibvwrap.c:115 NCCL WARN Call to ibv_reg_mr_iova2 failed
A100-02:56101:56272 [3] NCCL INFO ib_plugin.c:634 -> 2
A100-02:56101:56272 [3] NCCL INFO transport/net.cc:680 -> 2
A100-02:56101:56272 [3] NCCL INFO proxy.cc:1306 -> 2
A100-02:56101:56272 [3] NCCL INFO proxy.cc:1377 -> 2
A100-02:56101:56272 [3] proxy.cc:1519 NCCL WARN [Proxy Service 11] Failed to execute operation Connect from rank 11, retcode 2
From the 4node log, it seems there is some problem in the IB connection on 4 nodes, so the setup of IB failed on ibv_reg_mr_iova2
. I would suggest using nccl_test to double check if nccl can work properly and if the problem still exists, IB configuration needs to be checked.
Or you can use env NCCL_IB_DISABLE to disable IB in nccl. But it will hurt the performance a lot.
@shijieliu - Thank you. Just to clarify, It works if we choose any two servers from the list. However, we encounter issues when selecting more than two servers.
Please find the mpirun results
mpirun -x NCCL_IB_GID_INDEX=3 -x LD_LIBRARY_PATH -np 8 -host A100-01:1,A100-02:1,A100-03:1,A100-04:1,A100-05:1,A100-06:1,A100-07:1,A100-08:1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 668913 on A100-01 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 668913 on A100-01 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 668913 on A100-01 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 668913 on A100-01 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 668913 on A100-01 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 668913 on A100-01 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 6 Group 0 Pid 668913 on A100-01 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 7 Group 0 Pid 668913 on A100-01 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 8 Group 0 Pid 629050 on A100-02 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 9 Group 0 Pid 629050 on A100-02 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 10 Group 0 Pid 629050 on A100-02 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 11 Group 0 Pid 629050 on A100-02 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 12 Group 0 Pid 629050 on A100-02 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 13 Group 0 Pid 629050 on A100-02 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 14 Group 0 Pid 629050 on A100-02 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 15 Group 0 Pid 629050 on A100-02 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 16 Group 0 Pid 627688 on A100-03 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 17 Group 0 Pid 627688 on A100-03 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 18 Group 0 Pid 627688 on A100-03 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 19 Group 0 Pid 627688 on A100-03 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 20 Group 0 Pid 627688 on A100-03 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 21 Group 0 Pid 627688 on A100-03 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 22 Group 0 Pid 627688 on A100-03 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 23 Group 0 Pid 627688 on A100-03 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 24 Group 0 Pid 634319 on A100-04 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 25 Group 0 Pid 634319 on A100-04 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 26 Group 0 Pid 634319 on A100-04 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 27 Group 0 Pid 634319 on A100-04 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 28 Group 0 Pid 634319 on A100-04 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 29 Group 0 Pid 634319 on A100-04 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 30 Group 0 Pid 634319 on A100-04 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 31 Group 0 Pid 634319 on A100-04 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 32 Group 0 Pid 641750 on A100-05 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 33 Group 0 Pid 641750 on A100-05 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 34 Group 0 Pid 641750 on A100-05 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 35 Group 0 Pid 641750 on A100-05 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 36 Group 0 Pid 641750 on A100-05 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 37 Group 0 Pid 641750 on A100-05 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 38 Group 0 Pid 641750 on A100-05 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 39 Group 0 Pid 641750 on A100-05 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 40 Group 0 Pid 1009189 on A100-06 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 41 Group 0 Pid 1009189 on A100-06 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 42 Group 0 Pid 1009189 on A100-06 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 43 Group 0 Pid 1009189 on A100-06 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 44 Group 0 Pid 1009189 on A100-06 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 45 Group 0 Pid 1009189 on A100-06 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 46 Group 0 Pid 1009189 on A100-06 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 47 Group 0 Pid 1009189 on A100-06 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 48 Group 0 Pid 369375 on A100-07 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 49 Group 0 Pid 369375 on A100-07 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 50 Group 0 Pid 369375 on A100-07 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 51 Group 0 Pid 369375 on A100-07 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 52 Group 0 Pid 369375 on A100-07 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 53 Group 0 Pid 369375 on A100-07 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 54 Group 0 Pid 369375 on A100-07 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 55 Group 0 Pid 369375 on A100-07 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 56 Group 0 Pid 389932 on A100-08 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 57 Group 0 Pid 389932 on A100-08 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 58 Group 0 Pid 389932 on A100-08 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 59 Group 0 Pid 389932 on A100-08 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 60 Group 0 Pid 389932 on A100-08 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 61 Group 0 Pid 389932 on A100-08 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 62 Group 0 Pid 389932 on A100-08 device 6 [0xc0] NVIDIA A100-SXM4-80GB
# Rank 63 Group 0 Pid 389932 on A100-08 device 7 [0xc3] NVIDIA A100-SXM4-80GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 4570.6 0.00 0.00 0 4465.4 0.00 0.00 0
16 4 float sum -1 4570.8 0.00 0.00 0 2533.6 0.00 0.00 0
32 8 float sum -1 4560.4 0.00 0.00 0 4465.0 0.00 0.00 0
64 16 float sum -1 4570.5 0.00 0.00 0 4209.5 0.00 0.00 0
128 32 float sum -1 4570.5 0.00 0.00 0 4571.8 0.00 0.00 0
256 64 float sum -1 200.5 0.00 0.00 0 4798.2 0.00 0.00 0
512 128 float sum -1 4570.5 0.00 0.00 0 4798.4 0.00 0.00 0
1024 256 float sum -1 4124.7 0.00 0.00 0 4563.5 0.00 0.00 0
2048 512 float sum -1 4796.4 0.00 0.00 0 4569.9 0.00 0.00 0
4096 1024 float sum -1 4576.6 0.00 0.00 0 4741.6 0.00 0.00 0
8192 2048 float sum -1 420.2 0.02 0.04 0 2659.5 0.00 0.01 0
16384 4096 float sum -1 1702.5 0.01 0.02 0 6618.2 0.00 0.00 0
32768 8192 float sum -1 12000 0.00 0.01 0 11217 0.00 0.01 0
65536 16384 float sum -1 4908.4 0.01 0.03 0 8330.8 0.01 0.02 0
131072 32768 float sum -1 9713.9 0.01 0.03 0 8160.6 0.02 0.03 0
262144 65536 float sum -1 11682 0.02 0.04 0 12827 0.02 0.04 0
524288 131072 float sum -1 10289 0.05 0.10 0 11678 0.04 0.09 0
1048576 262144 float sum -1 12367 0.08 0.17 0 5438.9 0.19 0.38 0
2097152 524288 float sum -1 11242 0.19 0.37 0 9516.6 0.22 0.43 0
4194304 1048576 float sum -1 7046.3 0.60 1.17 0 16278 0.26 0.51 0
8388608 2097152 float sum -1 8269.7 1.01 2.00 0 11725 0.72 1.41 0
16777216 4194304 float sum -1 14935 1.12 2.21 0 18339 0.91 1.80 0
33554432 8388608 float sum -1 26406 1.27 2.50 0 16494 2.03 4.01 0
67108864 16777216 float sum -1 69719 0.96 1.90 0 63498 1.06 2.08 0
134217728 33554432 float sum -1 27128 4.95 9.74 0 54305 2.47 4.87 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0.719921
#
mpirun -x NCCL_IB_GID_INDEX=3 -x LD_LIBRARY_PATH -np 8 -host A100-01:1,A100-02:1,A100-03:1,A100-04:1,A100-05:1,A100-06:1,A100-07:1,A100-08:1 ./build/alltoall_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 669187 on A100-01 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 669187 on A100-01 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 669187 on A100-01 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 669187 on A100-01 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 669187 on A100-01 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 669187 on A100-01 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 6 Group 0 Pid 669187 on A100-01 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 7 Group 0 Pid 669187 on A100-01 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 8 Group 0 Pid 629384 on A100-02 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 9 Group 0 Pid 629384 on A100-02 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 10 Group 0 Pid 629384 on A100-02 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 11 Group 0 Pid 629384 on A100-02 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 12 Group 0 Pid 629384 on A100-02 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 13 Group 0 Pid 629384 on A100-02 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 14 Group 0 Pid 629384 on A100-02 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 15 Group 0 Pid 629384 on A100-02 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 16 Group 0 Pid 628018 on A100-03 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 17 Group 0 Pid 628018 on A100-03 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 18 Group 0 Pid 628018 on A100-03 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 19 Group 0 Pid 628018 on A100-03 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 20 Group 0 Pid 628018 on A100-03 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 21 Group 0 Pid 628018 on A100-03 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 22 Group 0 Pid 628018 on A100-03 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 23 Group 0 Pid 628018 on A100-03 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 24 Group 0 Pid 634691 on A100-04 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 25 Group 0 Pid 634691 on A100-04 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 26 Group 0 Pid 634691 on A100-04 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 27 Group 0 Pid 634691 on A100-04 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 28 Group 0 Pid 634691 on A100-04 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 29 Group 0 Pid 634691 on A100-04 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 30 Group 0 Pid 634691 on A100-04 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 31 Group 0 Pid 634691 on A100-04 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 32 Group 0 Pid 642079 on A100-05 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 33 Group 0 Pid 642079 on A100-05 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 34 Group 0 Pid 642079 on A100-05 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 35 Group 0 Pid 642079 on A100-05 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 36 Group 0 Pid 642079 on A100-05 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 37 Group 0 Pid 642079 on A100-05 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 38 Group 0 Pid 642079 on A100-05 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 39 Group 0 Pid 642079 on A100-05 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 40 Group 0 Pid 1009533 on A100-06 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 41 Group 0 Pid 1009533 on A100-06 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 42 Group 0 Pid 1009533 on A100-06 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 43 Group 0 Pid 1009533 on A100-06 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 44 Group 0 Pid 1009533 on A100-06 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 45 Group 0 Pid 1009533 on A100-06 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 46 Group 0 Pid 1009533 on A100-06 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 47 Group 0 Pid 1009533 on A100-06 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 48 Group 0 Pid 369712 on A100-07 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 49 Group 0 Pid 369712 on A100-07 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 50 Group 0 Pid 369712 on A100-07 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 51 Group 0 Pid 369712 on A100-07 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 52 Group 0 Pid 369712 on A100-07 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 53 Group 0 Pid 369712 on A100-07 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 54 Group 0 Pid 369712 on A100-07 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 55 Group 0 Pid 369712 on A100-07 device 7 [0xc4] NVIDIA A100-SXM4-80GB
# Rank 56 Group 0 Pid 390293 on A100-08 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 57 Group 0 Pid 390293 on A100-08 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 58 Group 0 Pid 390293 on A100-08 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 59 Group 0 Pid 390293 on A100-08 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 60 Group 0 Pid 390293 on A100-08 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 61 Group 0 Pid 390293 on A100-08 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 62 Group 0 Pid 390293 on A100-08 device 6 [0xc0] NVIDIA A100-SXM4-80GB
# Rank 63 Group 0 Pid 390293 on A100-08 device 7 [0xc3] NVIDIA A100-SXM4-80GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
0 0 float none -1 3534.3 0.00 0.00 0 2704.1 0.00 0.00 N/A
0 0 float none -1 3099.7 0.00 0.00 0 2125.0 0.00 0.00 N/A
0 0 float none -1 3002.5 0.00 0.00 0 3008.2 0.00 0.00 N/A
0 0 float none -1 4632.9 0.00 0.00 0 2210.1 0.00 0.00 N/A
0 0 float none -1 2660.9 0.00 0.00 0 2651.2 0.00 0.00 N/A
256 1 float none -1 3907.2 0.00 0.00 0 3129.9 0.00 0.00 N/A
512 2 float none -1 2462.0 0.00 0.00 0 2431.6 0.00 0.00 N/A
1024 4 float none -1 2811.1 0.00 0.00 0 3486.8 0.00 0.00 N/A
2048 8 float none -1 2897.2 0.00 0.00 0 5039.0 0.00 0.00 N/A
4096 16 float none -1 3429.5 0.00 0.00 0 3816.3 0.00 0.00 N/A
8192 32 float none -1 2505.1 0.00 0.00 0 2878.3 0.00 0.00 N/A
16384 64 float none -1 3662.6 0.00 0.00 0 3895.9 0.00 0.00 N/A
32768 128 float none -1 2944.6 0.01 0.01 0 4222.1 0.01 0.01 N/A
65536 256 float none -1 2668.4 0.02 0.02 0 3918.0 0.02 0.02 N/A
131072 512 float none -1 4911.2 0.03 0.03 0 2682.6 0.05 0.05 N/A
262144 1024 float none -1 3467.1 0.08 0.07 0 4439.0 0.06 0.06 N/A
524288 2048 float none -1 3285.0 0.16 0.16 0 3533.2 0.15 0.15 N/A
1048576 4096 float none -1 3316.6 0.32 0.31 0 3901.3 0.27 0.26 N/A
2097152 8192 float none -1 6127.9 0.34 0.34 0 6246.9 0.34 0.33 N/A
4194304 16384 float none -1 9784.5 0.43 0.42 0 6545.6 0.64 0.63 N/A
8388608 32768 float none -1 8146.6 1.03 1.01 0 6597.8 1.27 1.25 N/A
16777216 65536 float none -1 15541 1.08 1.06 0 10896 1.54 1.52 N/A
33554432 131072 float none -1 56507 0.59 0.58 0 51233 0.65 0.64 N/A
67108864 262144 float none -1 53106 1.26 1.24 0 49295 1.36 1.34 N/A
134217728 524288 float none -1 76311 1.76 1.73 0 76799 1.75 1.72 N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0.299857
#
Thanks @jndinesh
The perf about allreduce and all2all in NCCL is very bad compared with the desired perf using IB. This is a perf issue however it may imply functional issue as well. Like the nccl_test does not using IB. Could you try setting NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT
with nccl_test and share the log? Thanks!
According to NCCL, could you also try to check remove the user limits on registering pinned memory
? For example, if you are using docker, try add --shm-size=1g --ulimit memlock=-1
when starting container.
@shijieliu - Thank you for your quick response and suggestion. Your attention to detail is much appreciated. Please find attached log results.txt, which contains the environment variables you suggested while executing mpirun.
mpirun was executed on a host machine (Bare metal server) and not on a container environment. However, I did observe RDMA statistics when mpirun was running.
Please find the config on host machine: /etc/security/limits.conf
Thanks, Dinesh
@jndinesh The reason for your bad nccl_test perf may be the wrong configuration about numa in nccl test. Could you try running nccl_test with mpirun -np 64 --bind-to numa --host A100-01:8,...
(notic the :8
) and remove -g 8
in all_reduce_perf argument? Let's see if this can solve the perf issue.
And another thing I want to check is your card type. Could you share the output of lspci
? Thanks!
Please find below the attached lspci.txt log and mpirun output.
Please note that instead of using 64, it was configured to use 8 in mpirun -np 8.
Configuring with 64 instead of 8 will result in an error. For your reference, the log with that configuration is also attached.
mpirun -x LD_LIBRARY_PATH -np 8 --bind-to numa --host A100-01:8 ./build/all_reduce_perf -b 8 -e 128M -f 2
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 932421 on A100-01 device 0 [0x07] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 932422 on A100-01 device 1 [0x0a] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 932423 on A100-01 device 2 [0x44] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 932424 on A100-01 device 3 [0x4a] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 932425 on A100-01 device 4 [0x84] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 932426 on A100-01 device 5 [0x8a] NVIDIA A100-SXM4-80GB
# Rank 6 Group 0 Pid 932427 on A100-01 device 6 [0xc1] NVIDIA A100-SXM4-80GB
# Rank 7 Group 0 Pid 932429 on A100-01 device 7 [0xc4] NVIDIA A100-SXM4-80GB
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 4450.3 0.00 0.00 0 137.8 0.00 0.00 0
16 4 float sum -1 137.8 0.00 0.00 0 137.7 0.00 0.00 0
32 8 float sum -1 137.9 0.00 0.00 0 138.0 0.00 0.00 0
64 16 float sum -1 137.9 0.00 0.00 0 137.8 0.00 0.00 0
128 32 float sum -1 138.3 0.00 0.00 0 138.3 0.00 0.00 0
256 64 float sum -1 139.4 0.00 0.00 0 139.6 0.00 0.00 0
512 128 float sum -1 139.5 0.00 0.01 0 139.8 0.00 0.01 0
1024 256 float sum -1 140.9 0.01 0.01 0 140.6 0.01 0.01 0
2048 512 float sum -1 141.0 0.01 0.03 0 141.4 0.01 0.03 0
4096 1024 float sum -1 143.2 0.03 0.05 0 143.1 0.03 0.05 0
8192 2048 float sum -1 146.1 0.06 0.10 0 144.5 0.06 0.10 0
16384 4096 float sum -1 145.6 0.11 0.20 0 144.9 0.11 0.20 0
32768 8192 float sum -1 150.8 0.22 0.38 0 149.7 0.22 0.38 0
65536 16384 float sum -1 151.2 0.43 0.76 0 148.7 0.44 0.77 0
131072 32768 float sum -1 149.8 0.87 1.53 0 147.8 0.89 1.55 0
262144 65536 float sum -1 149.7 1.75 3.06 0 148.9 1.76 3.08 0
524288 131072 float sum -1 154.5 3.39 5.94 0 154.8 3.39 5.93 0
1048576 262144 float sum -1 158.6 6.61 11.57 0 3987.5 0.26 0.46 0
2097152 524288 float sum -1 186.6 11.24 19.67 0 195.0 10.75 18.82 0
4194304 1048576 float sum -1 207.8 20.19 35.33 0 207.6 20.21 35.36 0
8388608 2097152 float sum -1 7664.4 1.09 1.92 0 376.6 22.27 38.98 0
16777216 4194304 float sum -1 1514.5 11.08 19.39 0 591.8 28.35 49.61 0
33554432 8388608 float sum -1 1302.6 25.76 45.08 0 2197.5 15.27 26.72 0
67108864 16777216 float sum -1 3326.5 20.17 35.30 0 6675.0 10.05 17.59 0
134217728 33554432 float sum -1 10766 12.47 21.82 0 16400 8.18 14.32 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 8.32216
#
mpirun -x LD_LIBRARY_PATH -np 64 --bind-to numa --host A100-01:8 ./build/all_reduce_perf -b 8 -e 128M -f 2
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 64
slots that were requested by the application:
./build/all_reduce_perf
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
After checking with @jndinesh @rgandikota, the issue is sovled by
--propagate=STACK
when launching srun.
Describe the bug Facing a CUDNN_STATUS_MAPPING_ERROR with cudnnSetStream while running MLCommons Training benchmark: https://github.com/mlcommons/training_results_v3.1/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr
To Reproduce
Expected behavior DLRM Reference implementation should start training on the cluster
Environment (please complete the following information):
Additional context