I'm aiming to assess the impact of modifying low-level communication library hyperparameters on distributed machine learning training throughput, and have selected your DLRM implementation as a benchmark workload.
Curiously, before making any alterations to the underlying communication libraries (i.e., utilizing default configurations for all hyperparameters), my performance logs have shown a substantial variance across multiple test runs. The disparity reaches two orders of magnitude, with iteration times sometimes around 20ms and other times reaching 1500ms, all under conditions where num-batch=50.
In pursuit of identifying the source of these substantial fluctuations, I have experimented with varying the launch method for DDP (utilizing both mpirun and torchrun), altering DDP parameters (num-workers, mlperf), modifying the number of participating nodes (ranging from 1 to 4 nodes), and have tried restarting machines, processes, and changing communication ports. Nevertheless, the cause of the fluctuations remains elusive.
My system specifications include:
Ubuntu 22.04.1 LTS Codename: jammy
Kernel version 5.15.0-105-generic
miniconda python3.8 torch version 2.3.0
GPU and CUDA/NCCL details as follows:
For reproducibility, I have set the num-batches parameter to 10 and consistently observed large performance differences across four repeated runs as follows:
Run 1 (average ~1320ms)
Finished training it 1/10 of epoch 0, 0/1=1776.15 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=1243.90 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=1314.72 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=1238.35 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=1257.99 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=1315.04 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=1293.88 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=1244.26 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=1277.02 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=1242.58 ms/it, loss 0.693220
Run 2 (average ~1073ms)
Finished training it 1/10 of epoch 0, 0/1=1575.05 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=1076.93 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=1116.18 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=1094.06 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=1098.16 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=1071.85 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=1128.99 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=1127.59 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=1106.33 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=1093.37 ms/it, loss 0.693220
Run 3 (average ~351ms)
Finished training it 1/10 of epoch 0, 0/1=804.80 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=297.06 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=295.61 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=312.81 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=284.89 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=316.30 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=266.81 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=316.45 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=310.37 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=313.58 ms/it, loss 0.693220
Run 4 (average ~155.54ms)
Finished training it 1/10 of epoch 0, 0/1=608.88 ms/it, loss 1.159998
Finished training it 2/10 of epoch 0, 0/1=122.61 ms/it, loss 0.896872
Finished training it 3/10 of epoch 0, 0/1=112.35 ms/it, loss 0.844668
Finished training it 4/10 of epoch 0, 0/1=76.45 ms/it, loss 0.690382
Finished training it 5/10 of epoch 0, 0/1=77.42 ms/it, loss 0.694125
Finished training it 6/10 of epoch 0, 0/1=110.31 ms/it, loss 0.693931
Finished training it 7/10 of epoch 0, 0/1=129.44 ms/it, loss 0.692185
Finished training it 8/10 of epoch 0, 0/1=62.11 ms/it, loss 0.693441
Finished training it 9/10 of epoch 0, 0/1=134.10 ms/it, loss 0.696777
Finished training it 10/10 of epoch 0, 0/1=121.83 ms/it, loss 0.693220
I would greatly appreciate any insights you may have on what could be causing these performance inconsistencies. Ensuring a stable baseline is crucial before I proceed with tweaking communication library hyperparameters.
I'm aiming to assess the impact of modifying low-level communication library hyperparameters on distributed machine learning training throughput, and have selected your DLRM implementation as a benchmark workload.
Curiously, before making any alterations to the underlying communication libraries (i.e., utilizing default configurations for all hyperparameters), my performance logs have shown a substantial variance across multiple test runs. The disparity reaches two orders of magnitude, with iteration times sometimes around 20ms and other times reaching 1500ms, all under conditions where num-batch=50.
In pursuit of identifying the source of these substantial fluctuations, I have experimented with varying the launch method for DDP (utilizing both mpirun and torchrun), altering DDP parameters (num-workers, mlperf), modifying the number of participating nodes (ranging from 1 to 4 nodes), and have tried restarting machines, processes, and changing communication ports. Nevertheless, the cause of the fluctuations remains elusive.
My system specifications include:
Ubuntu 22.04.1 LTS Codename: jammy Kernel version 5.15.0-105-generic miniconda python3.8 torch version 2.3.0 GPU and CUDA/NCCL details as follows:
Here is the command used to start the training script (of the master node):
For reproducibility, I have set the num-batches parameter to 10 and consistently observed large performance differences across four repeated runs as follows:
Run 1 (average ~1320ms)
Run 2 (average ~1073ms)
Run 3 (average ~351ms)
Run 4 (average ~155.54ms)
I would greatly appreciate any insights you may have on what could be causing these performance inconsistencies. Ensuring a stable baseline is crucial before I proceed with tweaking communication library hyperparameters.
Thank you for your time and support. ❤