NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
905 stars 196 forks source link

[BUG] CUDNN_STATUS_MAPPING_ERROR with cudnnSetStream #433

Closed rgandikota closed 6 months ago

rgandikota commented 7 months ago

Describe the bug Facing a CUDNN_STATUS_MAPPING_ERROR with cudnnSetStream while running MLCommons Training benchmark: https://github.com/mlcommons/training_results_v3.1/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr

To Reproduce

  1. Followed the steps in this README.md for data preprocessing followed by training execution: https://github.com/mlcommons/training_results_v3.1/blob/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr/README.md
  2. We ran this both on a single node and an 8 node cluster with each server containing 8 GPUs. We should source the appropriate config

Expected behavior DLRM Reference implementation should start training on the cluster

Environment (please complete the following information):

Additional context

  1. Able to reliably reproduce this issue when executing this code both on a single node and 8 node cluster
  2. Stack trace: terminate called recursively what(): Runtime error: CUDNN_STATUS_MAPPING_ERROR cudnnSetStream(cudnnhandle, current_stream) (set_stream @ /workspace/dlrm/hugectr/HugeCTR/include/gpu_resource.hpp:80) terminate called recursively terminate called recursively terminate called after throwing an instance of 'HugeCTR::core23::RuntimeError' [A100-06:286040] Process received signal
nv-ananjappa commented 7 months ago

@minseokl Could you help investigate this issue with running NVIDIA's MLPerf Training workload?

shijieliu commented 7 months ago

hi @rgandikota what's the configuration you are using? Is it this one? If yes, could you turn off cuda graph and overlap to see if there is more specific error message? The cuda graph and overlap can be turned off here

rgandikota commented 7 months ago

hi @rgandikota what's the configuration you are using? Is it this one? If yes, could you turn off cuda graph and overlap to see if there is more specific error message? The cuda graph and overlap can be turned off here

Hi @shijieliu. We ran the same training after turning off both cuda graphs and overlap. The error is still the same. Please find the full stack trace below.

Wanted to highlight a Warning we are seeing. Not sure if this can cause issues. [1701758842.513479] [735c14efcde7:878 :0] ucp_context.c:1774 UCX WARN UCP version is incompatible, required: 1.15, actual: 1.12 (release 1) [1701758842.525161] [735c14efcde7:878 :0] ucp_context.c:1774 UCX WARN UCP version is incompatible, required: 1.15, actual: 1.12 (release 1) HugeCTR Version: 23.8

Logs

=====================================================ModelFit===================================================== [HCTR][06:48:08.684][INFO][RK0][main]: Use non-epoch mode with number of iterations: 512110 [HCTR][06:48:08.684][INFO][RK0][main]: Training batchsize: 8192, evaluation batchsize: 16384 [HCTR][06:48:08.684][INFO][RK0][main]: Evaluation interval: 25605, snapshot interval: 2000000 [HCTR][06:48:08.684][INFO][RK0][main]: Dense network trainable: True [HCTR][06:48:08.684][INFO][RK0][main]: Use mixed precision: False, scaler: 1.000000, use cuda graph: False [HCTR][06:48:08.684][INFO][RK0][main]: lr: 0.005000, warmup_steps: 0, end_lr: 0.000000 [HCTR][06:48:08.684][INFO][RK0][main]: decay_start: 0, decay_steps: 0, decay_power: 2.000000 [HCTR][06:48:08.684][INFO][RK0][main]: Training source file: /data/train_data.bin [HCTR][06:48:08.684][INFO][RK0][main]: Evaluation source file: /data_val/val_data.bin :::MLLOG {"namespace": "", "time_ms": 1701758888684, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/workspace/dlrm/mlperf_logger/callbacks.py", "lineno": 50}} :::MLLOG {"namespace": "", "time_ms": 1701758888685, "event_type": "INTERVAL_START", "key": "run_start", "value": null, "metadata": {"file": "/workspace/dlrm/mlperf_logger/callbacks.py", "lineno": 50}} :::MLLOG {"namespace": "", "time_ms": 1701758888685, "event_type": "INTERVAL_START", "key": "epoch_start", "value": null, "metadata": {"file": "/workspace/dlrm/mlperf_logger/callbacks.py", "lineno": 51, "epoch_num": 0}} terminate called after throwing an instance of 'HugeCTR::core23::RuntimeError' what(): Runtime error: CUDNN_STATUS_MAPPING_ERROR cudnnSetStream(cudnn_handle_, current_stream) (set_stream @ /workspace/dlrm/hugectr/HugeCTR/include/gpu_resource.hpp:80) [735c14efcde7:00878] *** Process received signal *** [735c14efcde7:00878] Signal: Aborted (6) [735c14efcde7:00878] Signal code: (-6) [735c14efcde7:00878] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ffba5754520] [735c14efcde7:00878] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7ffba57a8a7c] [735c14efcde7:00878] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7ffba5754476] [735c14efcde7:00878] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7ffba573a7f3] [735c14efcde7:00878] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7ffb9c807b9e] [735c14efcde7:00878] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7ffb9c81320c] [735c14efcde7:00878] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7ffb9c8121e9] [735c14efcde7:00878] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7ffb9c812959] [735c14efcde7:00878] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7ffba29f6884] [735c14efcde7:00878] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7ffba29f6f41] [735c14efcde7:00878] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x7ffb9c8134cb] [735c14efcde7:00878] [11] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR11GPUResource10set_streamERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0x345)[0x7ffaeec42895] [735c14efcde7:00878] [12] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR25StreamContextScheduleable3runESt10shared_ptrINS_11GPUResourceEEb+0x46d)[0x7ffaeec41f8d] [735c14efcde7:00878] [13] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR8Pipeline3runEv+0x10b)[0x7ffaeec40beb] [735c14efcde7:00878] [14] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xe0383c)[0x7ffaeece483c] [735c14efcde7:00878] [15] /usr/lib/x86_64-linux-gnu/libgomp.so.1(+0x1dc0e)[0x7ffb9c738c0e] [735c14efcde7:00878] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7ffba57a6b43] [735c14efcde7:00878] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7ffba5838a00] [735c14efcde7:00878] *** End of error message ***

shijieliu commented 7 months ago

It will be helpful if you could make the ucp version compatible.

What's your command line arguments passing to the training script?

jndinesh commented 7 months ago

@shijieliu - Here are the details (Details that was run on single node)

Test run 1: We set the environment as configured here

root@5795011ad9d8:/workspace/dlrm# source config_DGXH100_1x8x6912.sh root@5795011ad9d8:/workspace/dlrm# python train.py

docker run --shm-size=1g --ulimit memlock=-1 --cap-add=sys_nice --security-opt seccomp=unconfined --runtime=nvidia --rm -it -v /mnt/weka/mlperf/data/dlrm/dataset/criteo_binary/:/data -v /mnt/weka/mlperf/data/dlrm/dataset/criteo_binary/:/data_val -it dlrm-mlperf:1

Test run 2: with default values docker run --shm-size=1g --ulimit memlock=-1 --cap-add=sys_nice --security-opt seccomp=unconfined --runtime=nvidia --rm -it -v /dlrm/dataset/criteo_binary/:/data -v /dlrm/dataset/criteo_binary/:/data_val -it dlrm-mlperf: Test run 2: with default values as defined in code. code location

root@5795011ad9d8:/workspace/dlrm# python train.py

shijieliu commented 7 months ago

I run with the default value and it works well. This is my log.txt.

I am suspecting the data preprocessing maybe wrong which leads to illegal memory access because of overflow in embedding input. Could you use the following code to check your dataset?

import os
import numpy as np
import struct
import random
from tqdm import tqdm

alpha = 1.1
output_filename = '/data/train_data.bin'

dataset_info = [
    (1, 3, 40000000),
    (1, 2, 39060),
    (1, 1, 17295),
    (1, 2, 7424),
    (1, 6, 20265),
    (1, 1, 3),
    (1, 1, 7122),
    (1, 1, 1543),
    (1, 1, 63),
    (1, 7, 40000000),
    (1, 3, 3067956),
    (1, 8, 405282),
    (1, 1, 10),
    (1, 6, 2209),
    (1, 9, 11938),
    (1, 5, 155),
    (1, 1, 4),
    (1, 1, 976),
    (1, 1, 14),
    (1, 12, 40000000),
    (1, 100, 40000000),
    (1, 27, 40000000),
    (1, 10, 590152),
    (1, 3, 12973),
    (1, 1, 108),
    (1, 1, 36),
]
offset = np.cumsum([0] + [v[1] for v in dataset_info])
num_dense_features = 13
num_label = 1

total_samples = 4195197692
n = 1024

num_cate_features = sum([num_table * hotness for (num_table, hotness, _) in dataset_info])
max_vocabulary_size = sum([v for _, _, v in dataset_info])
item_num_per_sample = 1 + num_dense_features + num_cate_features
sample_format = r"1I" + str(num_dense_features) + "f" + str(num_cate_features) + "I"
sample_size_in_bytes = 1 * 4 + num_dense_features * 4 + num_cate_features * 4

min_vocabulary_size = np.asarray([np.iinfo(np.int64).max for _ in range(26)])
max_vocabulary_size = np.asarray([np.iinfo(np.int64).min for _ in range(26)])

assert os.path.getsize(output_filename) == total_samples * sample_size_in_bytes
with open(output_filename, "rb") as file:
    for i in tqdm(range(100000)):
        samples_bytes = file.read(sample_size_in_bytes * n)
        samples = struct.unpack(sample_format * n, samples_bytes)
        samples = np.asarray(samples).reshape(n, -1)
        cate_features = samples[:, num_dense_features + num_label:].astype(np.int64)
        min_cate_features = np.asarray([np.min(cate_features[:, start: end].reshape(-1)) for start, end in zip(offset, offset[1:])])
        min_vocabulary_size = np.minimum(min_cate_features, min_vocabulary_size)
        max_cate_features = np.asarray([np.max(cate_features[:, start: end].reshape(-1)) for start, end in zip(offset, offset[1:])])
        max_vocabulary_size = np.maximum(max_cate_features, max_vocabulary_size)

print('min_vocabulary_size', min_vocabulary_size)
print('max_vocabulary_size', max_vocabulary_size)

The script prints the range of input category and in my side the output is

min_vocabulary_size [0 1 2 1 0 0 2 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2]
max_vocabulary_size [39999999    39059    17176     7421    20264        2     7085     1534  62 39999996  3067955   405281        9     2208    11937      154        3      973       13 39999999 39999999 39999999   590151    12972      107       34]

You can share your output so I can help check.

rgandikota commented 6 months ago

@shijieliu The script for validating our data fails on a size assertion by the looks of it. Please find the output below:

python test_preprocessing.py Traceback (most recent call last): File "test_preprocessing.py", line 55, in <module> assert os.path.getsize(output_filename) == total_samples * sample_size_in_bytes AssertionError

Here is the process we used to preprocess our data using NVTabular to speed up the preprocessing: https://github.com/pytorch/torchrec/tree/main/torchrec/datasets/scripts/nvt

Is the script you shared specific to the output from this method of preprocessing mentioned in this ReadMe using CPU only? https://github.com/mlcommons/training_results_v3.1/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr

shijieliu commented 6 months ago

Is the script you shared specific to the output from this method of preprocessing mentioned in this ReadMe using CPU only?

Yes. This ReadMe should match with our training scripts. There is at least one difference I can tell beteen the one using NVTabular and our ReadMe is that the one using NVTabular lacks converting one-hot dataset to multi-hot dataset which is the step 1.5 in our ReadMe.

jndinesh commented 6 months ago

Thanks much @shijieliu. Is there anyway we can download pre-processed data? That will be a great help if we can download the data. This will reduce lot of time and confusion instead repeating the steps which may go wrong.

jndinesh commented 6 months ago

@shijieliu - Thanks for providing the snippet. It was helpful in verifying the dataset.

We have executed and tested the code you provided on the dataset. You were right in pointing out that the conversion from a one-hot dataset to a multi-hot dataset was missing in a previous step.

However, we are currently facing another issue on the A100 (multi-node) platform with HUGECTR, specifically regarding communication.

fyi - we were able to successfully run BERT.

Any input or information would be greatly appreciated and will assist us in moving forward.

A100-02:2358430:2358556 [1] [proxy.cc:1495](http://proxy.cc:1495/) NCCL WARN [Service thread] Could not receive type from localRank 7, res=3, closed=0

A100-02:2358430:2358556 [1] [proxy.cc:1519](http://proxy.cc:1519/) NCCL WARN [Proxy Service 9] Failed to execute operation Connect from rank 15, retcode 3 A100-02:2358430:2358556 [1] NCCL INFO socketProgressOpt: abort called A100-02:2358430:2358556 [1] NCCL INFO misc/socket.cc:805 -> 3

A100-02:2358430:2358556 [1] [proxy.cc:1495](http://proxy.cc:1495/) NCCL WARN [Service thread] Could not receive type from localRank 6, res=3, closed=0

A100-02:2358430:2358556 [1] [proxy.cc:1519](http://proxy.cc:1519/) NCCL WARN [Proxy Service 9] Failed to execute operation Connect from rank 14, retcode 3 Traceback (most recent call last): File "/workspace/dlrm/train.py", line 368, in model = hugectr.Model(solver, reader, optimizer) RuntimeError: Runtime error: internal error - please report this issue to the NCCL developers ncclGroupEnd() (all2all_warmup @ /workspace/dlrm/hugectr/HugeCTR/src/resource_managers/resource_manager_core.cpp:63)

/workspace/dlrm# python validate_dataset.py 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [28:09<00:00, 59.19it/s] min_vocabulary_size [2 1 2 1 0 0 2 2 2 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2] max_vocabulary_size [39999999 39059 15213 7421 20264 2 6776 1348 62 39999999 3067955 405281 9 2208 11937 154 3 973 13 39999999 39999999 39999999 590151 12972 98 34]

shijieliu commented 6 months ago

Hi @rgandikota glad to see the dataset seems right!

And for your questions, it seems that the nccl init failed which happens before traning starts. Could you try set NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT and see if there is more sepcific information? Thanks!

jndinesh commented 6 months ago

Please find the log attached with suggested configuration. (Anything more than 2 nodes results with an error)

Example run on 4 nodes: slurm-938.log

Error Snippet: A100-08:3103687:3103857 [2] ibvwrap.c:115 NCCL WARN Call to ibv_reg_mr_iova2 failed A100-08:3103687:3103857 [2] NCCL INFO ib_plugin.c:634 -> 2 A100-08:3103687:3103857 [2] NCCL INFO transport/net.cc:680 -> 2 A100-08:3103687:3103857 [2] NCCL INFO proxy.cc:1306 -> 2 A100-08:3103687:3103857 [2] NCCL INFO proxy.cc:1377 -> 2

Note: It works with 2 nodes [slurm-935_2nodes.log]

slurm-935_2nodes.log

shijieliu commented 6 months ago
A100-02:56101:56272 [3] ibvwrap.c:115 NCCL WARN Call to ibv_reg_mr_iova2 failed
A100-02:56101:56272 [3] NCCL INFO ib_plugin.c:634 -> 2
A100-02:56101:56272 [3] NCCL INFO transport/net.cc:680 -> 2
A100-02:56101:56272 [3] NCCL INFO proxy.cc:1306 -> 2
A100-02:56101:56272 [3] NCCL INFO proxy.cc:1377 -> 2

A100-02:56101:56272 [3] proxy.cc:1519 NCCL WARN [Proxy Service 11] Failed to execute operation Connect from rank 11, retcode 2

From the 4node log, it seems there is some problem in the IB connection on 4 nodes, so the setup of IB failed on ibv_reg_mr_iova2. I would suggest using nccl_test to double check if nccl can work properly and if the problem still exists, IB configuration needs to be checked.

Or you can use env NCCL_IB_DISABLE to disable IB in nccl. But it will hurt the performance a lot.

jndinesh commented 6 months ago

@shijieliu - Thank you. Just to clarify, It works if we choose any two servers from the list. However, we encounter issues when selecting more than two servers.

jndinesh commented 6 months ago

Please find the mpirun results


mpirun -x NCCL_IB_GID_INDEX=3 -x LD_LIBRARY_PATH -np 8 -host A100-01:1,A100-02:1,A100-03:1,A100-04:1,A100-05:1,A100-06:1,A100-07:1,A100-08:1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 668913 on    A100-01 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 668913 on    A100-01 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 668913 on    A100-01 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 668913 on    A100-01 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid 668913 on    A100-01 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid 668913 on    A100-01 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid 668913 on    A100-01 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid 668913 on    A100-01 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank  8 Group  0 Pid 629050 on    A100-02 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  9 Group  0 Pid 629050 on    A100-02 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 10 Group  0 Pid 629050 on    A100-02 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 11 Group  0 Pid 629050 on    A100-02 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 12 Group  0 Pid 629050 on    A100-02 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 13 Group  0 Pid 629050 on    A100-02 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 14 Group  0 Pid 629050 on    A100-02 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 15 Group  0 Pid 629050 on    A100-02 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 16 Group  0 Pid 627688 on    A100-03 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 17 Group  0 Pid 627688 on    A100-03 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 18 Group  0 Pid 627688 on    A100-03 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 19 Group  0 Pid 627688 on    A100-03 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 20 Group  0 Pid 627688 on    A100-03 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 21 Group  0 Pid 627688 on    A100-03 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 22 Group  0 Pid 627688 on    A100-03 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 23 Group  0 Pid 627688 on    A100-03 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 24 Group  0 Pid 634319 on    A100-04 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 25 Group  0 Pid 634319 on    A100-04 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 26 Group  0 Pid 634319 on    A100-04 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 27 Group  0 Pid 634319 on    A100-04 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 28 Group  0 Pid 634319 on    A100-04 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 29 Group  0 Pid 634319 on    A100-04 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 30 Group  0 Pid 634319 on    A100-04 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 31 Group  0 Pid 634319 on    A100-04 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 32 Group  0 Pid 641750 on    A100-05 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 33 Group  0 Pid 641750 on    A100-05 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 34 Group  0 Pid 641750 on    A100-05 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 35 Group  0 Pid 641750 on    A100-05 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 36 Group  0 Pid 641750 on    A100-05 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 37 Group  0 Pid 641750 on    A100-05 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 38 Group  0 Pid 641750 on    A100-05 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 39 Group  0 Pid 641750 on    A100-05 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 40 Group  0 Pid 1009189 on    A100-06 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 41 Group  0 Pid 1009189 on    A100-06 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 42 Group  0 Pid 1009189 on    A100-06 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 43 Group  0 Pid 1009189 on    A100-06 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 44 Group  0 Pid 1009189 on    A100-06 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 45 Group  0 Pid 1009189 on    A100-06 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 46 Group  0 Pid 1009189 on    A100-06 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 47 Group  0 Pid 1009189 on    A100-06 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 48 Group  0 Pid 369375 on    A100-07 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 49 Group  0 Pid 369375 on    A100-07 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 50 Group  0 Pid 369375 on    A100-07 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 51 Group  0 Pid 369375 on    A100-07 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 52 Group  0 Pid 369375 on    A100-07 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 53 Group  0 Pid 369375 on    A100-07 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 54 Group  0 Pid 369375 on    A100-07 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 55 Group  0 Pid 369375 on    A100-07 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 56 Group  0 Pid 389932 on    A100-08 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 57 Group  0 Pid 389932 on    A100-08 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 58 Group  0 Pid 389932 on    A100-08 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 59 Group  0 Pid 389932 on    A100-08 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 60 Group  0 Pid 389932 on    A100-08 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 61 Group  0 Pid 389932 on    A100-08 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 62 Group  0 Pid 389932 on    A100-08 device  6 [0xc0] NVIDIA A100-SXM4-80GB
#  Rank 63 Group  0 Pid 389932 on    A100-08 device  7 [0xc3] NVIDIA A100-SXM4-80GB

#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1   4570.6    0.00    0.00      0   4465.4    0.00    0.00      0
          16             4     float     sum      -1   4570.8    0.00    0.00      0   2533.6    0.00    0.00      0
          32             8     float     sum      -1   4560.4    0.00    0.00      0   4465.0    0.00    0.00      0
          64            16     float     sum      -1   4570.5    0.00    0.00      0   4209.5    0.00    0.00      0
         128            32     float     sum      -1   4570.5    0.00    0.00      0   4571.8    0.00    0.00      0
         256            64     float     sum      -1    200.5    0.00    0.00      0   4798.2    0.00    0.00      0
         512           128     float     sum      -1   4570.5    0.00    0.00      0   4798.4    0.00    0.00      0
        1024           256     float     sum      -1   4124.7    0.00    0.00      0   4563.5    0.00    0.00      0
        2048           512     float     sum      -1   4796.4    0.00    0.00      0   4569.9    0.00    0.00      0
        4096          1024     float     sum      -1   4576.6    0.00    0.00      0   4741.6    0.00    0.00      0
        8192          2048     float     sum      -1    420.2    0.02    0.04      0   2659.5    0.00    0.01      0
       16384          4096     float     sum      -1   1702.5    0.01    0.02      0   6618.2    0.00    0.00      0
       32768          8192     float     sum      -1    12000    0.00    0.01      0    11217    0.00    0.01      0
       65536         16384     float     sum      -1   4908.4    0.01    0.03      0   8330.8    0.01    0.02      0
      131072         32768     float     sum      -1   9713.9    0.01    0.03      0   8160.6    0.02    0.03      0
      262144         65536     float     sum      -1    11682    0.02    0.04      0    12827    0.02    0.04      0
      524288        131072     float     sum      -1    10289    0.05    0.10      0    11678    0.04    0.09      0
     1048576        262144     float     sum      -1    12367    0.08    0.17      0   5438.9    0.19    0.38      0
     2097152        524288     float     sum      -1    11242    0.19    0.37      0   9516.6    0.22    0.43      0
     4194304       1048576     float     sum      -1   7046.3    0.60    1.17      0    16278    0.26    0.51      0
     8388608       2097152     float     sum      -1   8269.7    1.01    2.00      0    11725    0.72    1.41      0
    16777216       4194304     float     sum      -1    14935    1.12    2.21      0    18339    0.91    1.80      0
    33554432       8388608     float     sum      -1    26406    1.27    2.50      0    16494    2.03    4.01      0
    67108864      16777216     float     sum      -1    69719    0.96    1.90      0    63498    1.06    2.08      0
   134217728      33554432     float     sum      -1    27128    4.95    9.74      0    54305    2.47    4.87      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.719921
#
mpirun -x NCCL_IB_GID_INDEX=3 -x LD_LIBRARY_PATH -np 8 -host A100-01:1,A100-02:1,A100-03:1,A100-04:1,A100-05:1,A100-06:1,A100-07:1,A100-08:1 ./build/alltoall_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 669187 on    A100-01 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 669187 on    A100-01 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 669187 on    A100-01 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 669187 on    A100-01 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid 669187 on    A100-01 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid 669187 on    A100-01 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid 669187 on    A100-01 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid 669187 on    A100-01 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank  8 Group  0 Pid 629384 on    A100-02 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  9 Group  0 Pid 629384 on    A100-02 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 10 Group  0 Pid 629384 on    A100-02 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 11 Group  0 Pid 629384 on    A100-02 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 12 Group  0 Pid 629384 on    A100-02 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 13 Group  0 Pid 629384 on    A100-02 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 14 Group  0 Pid 629384 on    A100-02 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 15 Group  0 Pid 629384 on    A100-02 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 16 Group  0 Pid 628018 on    A100-03 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 17 Group  0 Pid 628018 on    A100-03 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 18 Group  0 Pid 628018 on    A100-03 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 19 Group  0 Pid 628018 on    A100-03 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 20 Group  0 Pid 628018 on    A100-03 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 21 Group  0 Pid 628018 on    A100-03 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 22 Group  0 Pid 628018 on    A100-03 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 23 Group  0 Pid 628018 on    A100-03 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 24 Group  0 Pid 634691 on    A100-04 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 25 Group  0 Pid 634691 on    A100-04 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 26 Group  0 Pid 634691 on    A100-04 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 27 Group  0 Pid 634691 on    A100-04 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 28 Group  0 Pid 634691 on    A100-04 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 29 Group  0 Pid 634691 on    A100-04 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 30 Group  0 Pid 634691 on    A100-04 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 31 Group  0 Pid 634691 on    A100-04 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 32 Group  0 Pid 642079 on    A100-05 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 33 Group  0 Pid 642079 on    A100-05 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 34 Group  0 Pid 642079 on    A100-05 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 35 Group  0 Pid 642079 on    A100-05 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 36 Group  0 Pid 642079 on    A100-05 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 37 Group  0 Pid 642079 on    A100-05 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 38 Group  0 Pid 642079 on    A100-05 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 39 Group  0 Pid 642079 on    A100-05 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 40 Group  0 Pid 1009533 on    A100-06 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 41 Group  0 Pid 1009533 on    A100-06 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 42 Group  0 Pid 1009533 on    A100-06 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 43 Group  0 Pid 1009533 on    A100-06 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 44 Group  0 Pid 1009533 on    A100-06 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 45 Group  0 Pid 1009533 on    A100-06 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 46 Group  0 Pid 1009533 on    A100-06 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 47 Group  0 Pid 1009533 on    A100-06 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 48 Group  0 Pid 369712 on    A100-07 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 49 Group  0 Pid 369712 on    A100-07 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 50 Group  0 Pid 369712 on    A100-07 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 51 Group  0 Pid 369712 on    A100-07 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 52 Group  0 Pid 369712 on    A100-07 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 53 Group  0 Pid 369712 on    A100-07 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 54 Group  0 Pid 369712 on    A100-07 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank 55 Group  0 Pid 369712 on    A100-07 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#  Rank 56 Group  0 Pid 390293 on    A100-08 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank 57 Group  0 Pid 390293 on    A100-08 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank 58 Group  0 Pid 390293 on    A100-08 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank 59 Group  0 Pid 390293 on    A100-08 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank 60 Group  0 Pid 390293 on    A100-08 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank 61 Group  0 Pid 390293 on    A100-08 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank 62 Group  0 Pid 390293 on    A100-08 device  6 [0xc0] NVIDIA A100-SXM4-80GB
#  Rank 63 Group  0 Pid 390293 on    A100-08 device  7 [0xc3] NVIDIA A100-SXM4-80GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1   3534.3    0.00    0.00      0   2704.1    0.00    0.00    N/A
           0             0     float    none      -1   3099.7    0.00    0.00      0   2125.0    0.00    0.00    N/A
           0             0     float    none      -1   3002.5    0.00    0.00      0   3008.2    0.00    0.00    N/A
           0             0     float    none      -1   4632.9    0.00    0.00      0   2210.1    0.00    0.00    N/A
           0             0     float    none      -1   2660.9    0.00    0.00      0   2651.2    0.00    0.00    N/A
         256             1     float    none      -1   3907.2    0.00    0.00      0   3129.9    0.00    0.00    N/A
         512             2     float    none      -1   2462.0    0.00    0.00      0   2431.6    0.00    0.00    N/A
        1024             4     float    none      -1   2811.1    0.00    0.00      0   3486.8    0.00    0.00    N/A
        2048             8     float    none      -1   2897.2    0.00    0.00      0   5039.0    0.00    0.00    N/A
        4096            16     float    none      -1   3429.5    0.00    0.00      0   3816.3    0.00    0.00    N/A
        8192            32     float    none      -1   2505.1    0.00    0.00      0   2878.3    0.00    0.00    N/A
       16384            64     float    none      -1   3662.6    0.00    0.00      0   3895.9    0.00    0.00    N/A
       32768           128     float    none      -1   2944.6    0.01    0.01      0   4222.1    0.01    0.01    N/A
       65536           256     float    none      -1   2668.4    0.02    0.02      0   3918.0    0.02    0.02    N/A
      131072           512     float    none      -1   4911.2    0.03    0.03      0   2682.6    0.05    0.05    N/A
      262144          1024     float    none      -1   3467.1    0.08    0.07      0   4439.0    0.06    0.06    N/A
      524288          2048     float    none      -1   3285.0    0.16    0.16      0   3533.2    0.15    0.15    N/A
     1048576          4096     float    none      -1   3316.6    0.32    0.31      0   3901.3    0.27    0.26    N/A
     2097152          8192     float    none      -1   6127.9    0.34    0.34      0   6246.9    0.34    0.33    N/A
     4194304         16384     float    none      -1   9784.5    0.43    0.42      0   6545.6    0.64    0.63    N/A
     8388608         32768     float    none      -1   8146.6    1.03    1.01      0   6597.8    1.27    1.25    N/A
    16777216         65536     float    none      -1    15541    1.08    1.06      0    10896    1.54    1.52    N/A
    33554432        131072     float    none      -1    56507    0.59    0.58      0    51233    0.65    0.64    N/A
    67108864        262144     float    none      -1    53106    1.26    1.24      0    49295    1.36    1.34    N/A
   134217728        524288     float    none      -1    76311    1.76    1.73      0    76799    1.75    1.72    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.299857
#
shijieliu commented 6 months ago

Thanks @jndinesh

The perf about allreduce and all2all in NCCL is very bad compared with the desired perf using IB. This is a perf issue however it may imply functional issue as well. Like the nccl_test does not using IB. Could you try setting NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT with nccl_test and share the log? Thanks!

shijieliu commented 6 months ago

According to NCCL, could you also try to check remove the user limits on registering pinned memory? For example, if you are using docker, try add --shm-size=1g --ulimit memlock=-1 when starting container.

jndinesh commented 6 months ago

@shijieliu - Thank you for your quick response and suggestion. Your attention to detail is much appreciated. Please find attached log results.txt, which contains the environment variables you suggested while executing mpirun.

mpirun was executed on a host machine (Bare metal server) and not on a container environment. However, I did observe RDMA statistics when mpirun was running.

Please find the config on host machine: /etc/security/limits.conf

image

Thanks, Dinesh

shijieliu commented 6 months ago

@jndinesh The reason for your bad nccl_test perf may be the wrong configuration about numa in nccl test. Could you try running nccl_test with mpirun -np 64 --bind-to numa --host A100-01:8,...(notic the :8) and remove -g 8 in all_reduce_perf argument? Let's see if this can solve the perf issue.

And another thing I want to check is your card type. Could you share the output of lspci? Thanks!

jndinesh commented 6 months ago

Please find below the attached lspci.txt log and mpirun output.

Please note that instead of using 64, it was configured to use 8 in mpirun -np 8.

Configuring with 64 instead of 8 will result in an error. For your reference, the log with that configuration is also attached.

mpirun -x LD_LIBRARY_PATH  -np 8 --bind-to numa --host A100-01:8  ./build/all_reduce_perf -b 8 -e 128M -f 2
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 932421 on    A100-01 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 932422 on    A100-01 device  1 [0x0a] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 932423 on    A100-01 device  2 [0x44] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 932424 on    A100-01 device  3 [0x4a] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid 932425 on    A100-01 device  4 [0x84] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid 932426 on    A100-01 device  5 [0x8a] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid 932427 on    A100-01 device  6 [0xc1] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid 932429 on    A100-01 device  7 [0xc4] NVIDIA A100-SXM4-80GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1   4450.3    0.00    0.00      0    137.8    0.00    0.00      0
          16             4     float     sum      -1    137.8    0.00    0.00      0    137.7    0.00    0.00      0
          32             8     float     sum      -1    137.9    0.00    0.00      0    138.0    0.00    0.00      0
          64            16     float     sum      -1    137.9    0.00    0.00      0    137.8    0.00    0.00      0
         128            32     float     sum      -1    138.3    0.00    0.00      0    138.3    0.00    0.00      0
         256            64     float     sum      -1    139.4    0.00    0.00      0    139.6    0.00    0.00      0
         512           128     float     sum      -1    139.5    0.00    0.01      0    139.8    0.00    0.01      0
        1024           256     float     sum      -1    140.9    0.01    0.01      0    140.6    0.01    0.01      0
        2048           512     float     sum      -1    141.0    0.01    0.03      0    141.4    0.01    0.03      0
        4096          1024     float     sum      -1    143.2    0.03    0.05      0    143.1    0.03    0.05      0
        8192          2048     float     sum      -1    146.1    0.06    0.10      0    144.5    0.06    0.10      0
       16384          4096     float     sum      -1    145.6    0.11    0.20      0    144.9    0.11    0.20      0
       32768          8192     float     sum      -1    150.8    0.22    0.38      0    149.7    0.22    0.38      0
       65536         16384     float     sum      -1    151.2    0.43    0.76      0    148.7    0.44    0.77      0
      131072         32768     float     sum      -1    149.8    0.87    1.53      0    147.8    0.89    1.55      0
      262144         65536     float     sum      -1    149.7    1.75    3.06      0    148.9    1.76    3.08      0
      524288        131072     float     sum      -1    154.5    3.39    5.94      0    154.8    3.39    5.93      0
     1048576        262144     float     sum      -1    158.6    6.61   11.57      0   3987.5    0.26    0.46      0
     2097152        524288     float     sum      -1    186.6   11.24   19.67      0    195.0   10.75   18.82      0
     4194304       1048576     float     sum      -1    207.8   20.19   35.33      0    207.6   20.21   35.36      0
     8388608       2097152     float     sum      -1   7664.4    1.09    1.92      0    376.6   22.27   38.98      0
    16777216       4194304     float     sum      -1   1514.5   11.08   19.39      0    591.8   28.35   49.61      0
    33554432       8388608     float     sum      -1   1302.6   25.76   45.08      0   2197.5   15.27   26.72      0
    67108864      16777216     float     sum      -1   3326.5   20.17   35.30      0   6675.0   10.05   17.59      0
   134217728      33554432     float     sum      -1    10766   12.47   21.82      0    16400    8.18   14.32      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.32216
#
mpirun  -x LD_LIBRARY_PATH  -np 64 --bind-to numa --host A100-01:8  ./build/all_reduce_perf -b 8 -e 128M -f 2
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 64
slots that were requested by the application:

  ./build/all_reduce_perf

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
shijieliu commented 6 months ago

After checking with @jndinesh @rgandikota, the issue is sovled by

  1. setting --propagate=STACK when launching srun.
  2. reinstall the environment.