Significant Variability in DLRM Benchmark Performance Metrics

I'm aiming to assess the impact of modifying low-level communication library hyperparameters on distributed machine learning training throughput, and have selected your DLRM implementation as a benchmark workload.

Curiously, before making any alterations to the underlying communication libraries (i.e., utilizing default configurations for all hyperparameters), my performance logs have shown a substantial variance across multiple test runs. The disparity reaches two orders of magnitude, with iteration times sometimes around 20ms and other times reaching 1500ms, all under conditions where num-batch=50.

In pursuit of identifying the source of these substantial fluctuations, I have experimented with varying the launch method for DDP (utilizing both mpirun and torchrun), altering DDP parameters (num-workers, mlperf), modifying the number of participating nodes (ranging from 1 to 4 nodes), and have tried restarting machines, processes, and changing communication ports. Nevertheless, the cause of the fluctuations remains elusive.

My system specifications include:

Ubuntu 22.04.1 LTS Codename: jammy Kernel version 5.15.0-105-generic miniconda python3.8 torch version 2.3.0 GPU and CUDA/NCCL details as follows:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           Off |   00000000:3B:00.0 Off |                    0 |
| N/A   32C    P0             36W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           Off |   00000000:86:00.0 Off |                    0 |
| N/A   31C    P0             38W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-PCIE-16GB           Off |   00000000:AF:00.0 Off |                    0 |
| N/A   31C    P0             37W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-PCIE-16GB           Off |   00000000:D8:00.0 Off |                    0 |
| N/A   33C    P0             38W /  250W |       0MiB /  16384MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Here is the command used to start the training script (of the master node):

NCCL_SOCKET_IFNAME=custom LD_LIBRARY_PATH=/root/openmpi/lib:/root/miniconda3/lib:/usr/local/cuda-12.4/lib64:/root/miniconda3/envs/lla/lib PATH=/root/miniconda3/bin:/root/miniconda3/condabin:/root/openmpi/ /root/miniconda3/envs/lla/bin/torchrun \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=192.168.245.165 \
    --master_port=1234 \
    /root/dlrm/dlrm_s_pytorch.py \
    --dist-backend nccl \
    --arch-embedding-size 1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
    --arch-sparse-feature-size 64 --arch-mlp-bot 512-512-64 \
    --arch-mlp-top 1024-1024-1024-1 \
    --max-ind-range 40000000 --data-generation random --loss-function bce --round-targets True --learning-rate 0.1 \
    --mini-batch-size 2048 --print-freq 1 --print-time --test-freq 0 --test-mini-batch-size 2048 \
    --use-gpu --num-batches 10  > 0520_4.txt 2>&1 &

For reproducibility, I have set the num-batches parameter to 10 and consistently observed large performance differences across four repeated runs as follows:

Run 1 (average ~1320ms)

Finished training it 1/10 of epoch 0, 0/1=1776.15 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=1243.90 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=1314.72 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=1238.35 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=1257.99 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=1315.04 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=1293.88 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=1244.26 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=1277.02 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=1242.58 ms/it, loss 0.693220

Run 2 (average ~1073ms)

Finished training it 1/10 of epoch 0, 0/1=1575.05 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=1076.93 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=1116.18 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=1094.06 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=1098.16 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=1071.85 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=1128.99 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=1127.59 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=1106.33 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=1093.37 ms/it, loss 0.693220

Run 3 (average ~351ms)

Finished training it 1/10 of epoch 0, 0/1=804.80 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=297.06 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=295.61 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=312.81 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=284.89 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=316.30 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=266.81 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=316.45 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=310.37 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=313.58 ms/it, loss 0.693220

Run 4 (average ~155.54ms)

Finished training it 1/10 of epoch 0, 0/1=608.88 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=122.61 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=112.35 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=76.45 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=77.42 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=110.31 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=129.44 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=62.11 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=134.10 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=121.83 ms/it, loss 0.693220

I would greatly appreciate any insights you may have on what could be causing these performance inconsistencies. Ensuring a stable baseline is crucial before I proceed with tweaking communication library hyperparameters.

Thank you for your time and support. ❤

facebookresearch / dlrm

Significant Variability in DLRM Benchmark Performance Metrics #383