Closed jhiemstrawisc closed 1 year ago
I tested this with 1 GPU and 2 GPUs. The jobs ran, but the 1 GPU job was still faster. I'm okay with that and don't plan to optimize the performance.
1 GPU:
Accuracy of the network after epoch 10 is: tensor(78.1000, device='cuda:0')
total DistributedDataParallel epochs time = 514.6743969917297
2 GPUs
Accuracy of the network after epoch 10 is: tensor(79.9700, device='cuda:0')
total DistributedDataParallel epochs time = 620.7149896621704
I did a fair amount of testing and confirmed:
Timing and accuracy details: 1 RTX 2080 Ti
Accuracy of the network after epoch 10 is: tensor(79.2600, device='cuda:0')
total DistributedDataParallel epochs time = 580.5027492046356
6 RTX 2080 Ti
Accuracy of the network after epoch 10 is: tensor(85.9500, device='cuda:0')
total DistributedDataParallel epochs time = 1067.5966954231262
1 A100
Accuracy of the network after epoch 10 is: tensor(81.0100, device='cuda:0')
total DistributedDataParallel epochs time = 227.14364576339722
2 A100
Accuracy of the network after epoch 10 is: tensor(81.9700, device='cuda:0')
total DistributedDataParallel epochs time = 230.76658964157104
3 A100
Accuracy of the network after epoch 10 is: tensor(83.4000, device='cuda:0')
total DistributedDataParallel epochs time = 235.09108924865723
3 A100 batch size 512
Accuracy of the network after epoch 10 is: tensor(73.7600, device='cuda:0')
total DistributedDataParallel epochs time = 229.12067866325378
3 A100 batch size 4096 (the poor accuracy!)
Accuracy of the network after epoch 10 is: tensor(33.4500, device='cuda:0')
total DistributedDataParallel epochs time = 361.6151912212372
GPU utilization details: 6 RTX 2080 Ti
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:1A:00.0 Off | N/A |
| 44% 66C P2 223W / 250W | 728MiB / 11264MiB | 71% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:1B:00.0 Off | N/A |
| 35% 59C P2 255W / 250W | 9880MiB / 11264MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... On | 00000000:3D:00.0 Off | N/A |
| 32% 54C P2 205W / 250W | 9880MiB / 11264MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... On | 00000000:3E:00.0 Off | N/A |
| 31% 53C P2 133W / 250W | 9880MiB / 11264MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... On | 00000000:88:00.0 Off | N/A |
| 31% 52C P2 157W / 250W | 9880MiB / 11264MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... On | 00000000:89:00.0 Off | N/A |
| 29% 50C P2 144W / 250W | 9880MiB / 11264MiB | 98% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce ... On | 00000000:B1:00.0 Off | N/A |
| 29% 49C P2 148W / 250W | 9880MiB / 11264MiB | 98% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce ... On | 00000000:B2:00.0 Off | N/A |
| 29% 31C P8 45W / 250W | 5MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 83733 C gmx 723MiB |
| 1 N/A N/A 56182 C ...envs/multigpu/bin/python3 9875MiB |
| 2 N/A N/A 56183 C ...envs/multigpu/bin/python3 9875MiB |
| 3 N/A N/A 56184 C ...envs/multigpu/bin/python3 9875MiB |
| 4 N/A N/A 56185 C ...envs/multigpu/bin/python3 9875MiB |
| 5 N/A N/A 56186 C ...envs/multigpu/bin/python3 9875MiB |
| 6 N/A N/A 56187 C ...envs/multigpu/bin/python3 9875MiB |
+-----------------------------------------------------------------------------+
3 A100
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:01:00.0 Off | 0 |
| N/A 48C P0 353W / 500W | 7488MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:41:00.0 Off | 0 |
| N/A 45C P0 308W / 500W | 7500MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:81:00.0 Off | 0 |
| N/A 46C P0 318W / 500W | 7500MiB / 81920MiB | 99% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:C1:00.0 Off | 0 |
| N/A 27C P0 58W / 500W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2027845 C ...envs/multigpu/bin/python3 7485MiB |
| 1 N/A N/A 2027846 C ...envs/multigpu/bin/python3 7497MiB |
| 2 N/A N/A 2027847 C ...envs/multigpu/bin/python3 7497MiB |
+-----------------------------------------------------------------------------+
3 A100 batch size 4096 (< 100% utilization may not be representative, only took one snapshot, it's making use of the GPU memory)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:01:00.0 Off | 0 |
| N/A 46C P0 413W / 500W | 53928MiB / 81920MiB | 29% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:41:00.0 Off | 0 |
| N/A 36C P0 86W / 500W | 53940MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:81:00.0 Off | 0 |
| N/A 37C P0 85W / 500W | 53940MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:C1:00.0 Off | 0 |
| N/A 28C P0 58W / 500W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2084456 C ...envs/multigpu/bin/python3 53925MiB |
| 1 N/A N/A 2084457 C ...envs/multigpu/bin/python3 53937MiB |
| 2 N/A N/A 2084458 C ...envs/multigpu/bin/python3 53937MiB |
+-----------------------------------------------------------------------------+
Finally, a couple of jobs failed with errors related to downloading the dataset. This download happens in the PyTorch code:
WARNING: md5sum mismatch of tar archive
expected: e02b57a107a66a686bd57f122ee702da
got: 363599244f296d6f939ce090e2ec0e02 -
Traceback (most recent call last):
File "tarfile.py", line 543, in _read
OSError: Invalid data stream
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "entry_point.py", line 76, in <module>
File "tarfile.py", line 2036, in extractall
File "tarfile.py", line 2077, in extract
File "tarfile.py", line 2150, in _extract_member
File "tarfile.py", line 2199, in makefile
File "tarfile.py", line 247, in copyfileobj
File "tarfile.py", line 521, in read
File "tarfile.py", line 545, in _read
tarfile.ReadError: invalid compressed data
[243571] Failed to execute script entry_point
Traceback (most recent call last):
File "entry_point.py", line 69, in <module>
File "concurrent/futures/process.py", line 726, in map
File "concurrent/futures/_base.py", line 597, in map
File "concurrent/futures/_base.py", line 597, in <listcomp>
File "concurrent/futures/process.py", line 681, in submit
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore
@b-kaufman will help review the PyTorch parts of the code so that @ChristinaLK or I can focus on the readme and HTCondor parts.
Ben, for a little context you'll see that this example code was using Weights & Biases for logging, but we commented it out of the example. We don't necessarily want the CHTC example code to "officially" recommend W&B as a third-party logging or profiling solution because it hasn't been vetted thoroughly and may change in ways in the future that CHTC is unaware of. For individual researchers, it may be a good option, but that's up to them to decide independently.