CHTC / templates-GPUs

Template job submissions using GPUs in CHTC
MIT License
39 stars 11 forks source link

Multigpu #24

Closed jhiemstrawisc closed 1 year ago

agitter commented 2 years ago

@b-kaufman will help review the PyTorch parts of the code so that @ChristinaLK or I can focus on the readme and HTCondor parts.

Ben, for a little context you'll see that this example code was using Weights & Biases for logging, but we commented it out of the example. We don't necessarily want the CHTC example code to "officially" recommend W&B as a third-party logging or profiling solution because it hasn't been vetted thoroughly and may change in ways in the future that CHTC is unaware of. For individual researchers, it may be a good option, but that's up to them to decide independently.

agitter commented 1 year ago

I tested this with 1 GPU and 2 GPUs. The jobs ran, but the 1 GPU job was still faster. I'm okay with that and don't plan to optimize the performance.

1 GPU:

Accuracy of the network after epoch 10 is: tensor(78.1000, device='cuda:0')
total DistributedDataParallel epochs time =  514.6743969917297

2 GPUs

Accuracy of the network after epoch 10 is: tensor(79.9700, device='cuda:0')
total DistributedDataParallel epochs time =  620.7149896621704
agitter commented 1 year ago

I did a fair amount of testing and confirmed:

Timing and accuracy details: 1 RTX 2080 Ti

Accuracy of the network after epoch 10 is: tensor(79.2600, device='cuda:0')
total DistributedDataParallel epochs time =  580.5027492046356

6 RTX 2080 Ti

Accuracy of the network after epoch 10 is: tensor(85.9500, device='cuda:0')
total DistributedDataParallel epochs time =  1067.5966954231262

1 A100

Accuracy of the network after epoch 10 is: tensor(81.0100, device='cuda:0')
total DistributedDataParallel epochs time =  227.14364576339722

2 A100

Accuracy of the network after epoch 10 is: tensor(81.9700, device='cuda:0')
total DistributedDataParallel epochs time =  230.76658964157104

3 A100

Accuracy of the network after epoch 10 is: tensor(83.4000, device='cuda:0')
total DistributedDataParallel epochs time =  235.09108924865723

3 A100 batch size 512

Accuracy of the network after epoch 10 is: tensor(73.7600, device='cuda:0')
total DistributedDataParallel epochs time =  229.12067866325378

3 A100 batch size 4096 (the poor accuracy!)

Accuracy of the network after epoch 10 is: tensor(33.4500, device='cuda:0')
total DistributedDataParallel epochs time =  361.6151912212372

GPU utilization details: 6 RTX 2080 Ti

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:1A:00.0 Off |                  N/A |
| 44%   66C    P2   223W / 250W |    728MiB / 11264MiB |     71%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:1B:00.0 Off |                  N/A |
| 35%   59C    P2   255W / 250W |   9880MiB / 11264MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:3D:00.0 Off |                  N/A |
| 32%   54C    P2   205W / 250W |   9880MiB / 11264MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  On   | 00000000:3E:00.0 Off |                  N/A |
| 31%   53C    P2   133W / 250W |   9880MiB / 11264MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  On   | 00000000:88:00.0 Off |                  N/A |
| 31%   52C    P2   157W / 250W |   9880MiB / 11264MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  On   | 00000000:89:00.0 Off |                  N/A |
| 29%   50C    P2   144W / 250W |   9880MiB / 11264MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce ...  On   | 00000000:B1:00.0 Off |                  N/A |
| 29%   49C    P2   148W / 250W |   9880MiB / 11264MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce ...  On   | 00000000:B2:00.0 Off |                  N/A |
| 29%   31C    P8    45W / 250W |      5MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     83733      C   gmx                               723MiB |
|    1   N/A  N/A     56182      C   ...envs/multigpu/bin/python3     9875MiB |
|    2   N/A  N/A     56183      C   ...envs/multigpu/bin/python3     9875MiB |
|    3   N/A  N/A     56184      C   ...envs/multigpu/bin/python3     9875MiB |
|    4   N/A  N/A     56185      C   ...envs/multigpu/bin/python3     9875MiB |
|    5   N/A  N/A     56186      C   ...envs/multigpu/bin/python3     9875MiB |
|    6   N/A  N/A     56187      C   ...envs/multigpu/bin/python3     9875MiB |
+-----------------------------------------------------------------------------+

3 A100

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   48C    P0   353W / 500W |   7488MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   45C    P0   308W / 500W |   7500MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   46C    P0   318W / 500W |   7500MiB / 81920MiB |     99%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                    0 |
| N/A   27C    P0    58W / 500W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2027845      C   ...envs/multigpu/bin/python3     7485MiB |
|    1   N/A  N/A   2027846      C   ...envs/multigpu/bin/python3     7497MiB |
|    2   N/A  N/A   2027847      C   ...envs/multigpu/bin/python3     7497MiB |
+-----------------------------------------------------------------------------+

3 A100 batch size 4096 (< 100% utilization may not be representative, only took one snapshot, it's making use of the GPU memory)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   46C    P0   413W / 500W |  53928MiB / 81920MiB |     29%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   36C    P0    86W / 500W |  53940MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   37C    P0    85W / 500W |  53940MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:C1:00.0 Off |                    0 |
| N/A   28C    P0    58W / 500W |      3MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2084456      C   ...envs/multigpu/bin/python3    53925MiB |
|    1   N/A  N/A   2084457      C   ...envs/multigpu/bin/python3    53937MiB |
|    2   N/A  N/A   2084458      C   ...envs/multigpu/bin/python3    53937MiB |
+-----------------------------------------------------------------------------+

Finally, a couple of jobs failed with errors related to downloading the dataset. This download happens in the PyTorch code:

WARNING: md5sum mismatch of tar archive
expected: e02b57a107a66a686bd57f122ee702da
     got: 363599244f296d6f939ce090e2ec0e02  -
Traceback (most recent call last):
  File "tarfile.py", line 543, in _read
OSError: Invalid data stream
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "entry_point.py", line 76, in <module>
  File "tarfile.py", line 2036, in extractall
  File "tarfile.py", line 2077, in extract
  File "tarfile.py", line 2150, in _extract_member
  File "tarfile.py", line 2199, in makefile
  File "tarfile.py", line 247, in copyfileobj
  File "tarfile.py", line 521, in read
  File "tarfile.py", line 545, in _read
tarfile.ReadError: invalid compressed data
[243571] Failed to execute script entry_point
Traceback (most recent call last):
  File "entry_point.py", line 69, in <module>
  File "concurrent/futures/process.py", line 726, in map
  File "concurrent/futures/_base.py", line 597, in map
  File "concurrent/futures/_base.py", line 597, in <listcomp>
  File "concurrent/futures/process.py", line 681, in submit
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore