EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
https://www.eleuther.ai/
Apache License 2.0
6.81k stars 988 forks source link

Distributed training with model parallelism hangs with the recent PR #985

Closed absol13 closed 1 year ago

absol13 commented 1 year ago

Describe the bug Hello, I found distributed training with the setting "model-parallel-size": >1 hangs. This situation appears in the source with the PR #958 is merged, and it did not appear in older sources or with the setting "model-parallel-size": 1 at all.

To Reproduce I share my config file below for reproduction.

Expected behavior Training should proceed further.

Proposed solution

Screenshots Specifically, training does not proceed at this point:

gpt-neox-train-dist-master: time (ms) | model and optimizer: 49512.12 | train/valid/test data iterators: 1488.60
gpt-neox-train-dist-master: training ...
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:553:forward] Activation Checkpointing Information
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:554:forward] ----Partition Activations True, CPU CHECKPOINTING False
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:557:forward] ----contiguous Memory Checkpointing False with 32 total layers
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:560:forward] ----Synchronization True
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:561:forward] ----Profiling time in checkpointing False

Also, I report logs from nvidia-smi command.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  15699MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   37C    P0    91W / 400W |  15247MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:48:00.0 Off |                    0 |
| N/A   37C    P0    99W / 400W |  15723MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4C:00.0 Off |                    0 |
| N/A   39C    P0    97W / 400W |  15259MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   36C    P0    93W / 400W |  15723MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:8B:00.0 Off |                    0 |
| N/A   40C    P0    93W / 400W |  15259MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:C8:00.0 Off |                    0 |
| N/A   39C    P0   105W / 400W |  15699MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:CB:00.0 Off |                    0 |
| N/A   37C    P0    90W / 400W |  15247MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

In the normal training procedure, GPU power usage reaches near its capacity, but I observed that it is far from its capacity which seems to mean the hanging process.

Environment (please complete the following information):



**Additional context**
Add any other context about the problem here.
StellaAthena commented 1 year ago

@honglu2875

honglu2875 commented 1 year ago

@absol13 Would #979 fix the issue? It was my oversight that I didn't realize data could be None. It could be possible that it hangs because some process errored out due to this bug.

absol13 commented 1 year ago

@honglu2875 I don't know the exact reason. Anyway, I will try distributed training after #979 is merged and let you know this bug is solved. Thanks.

StellaAthena commented 1 year ago

I have reproduced this issue and confirmed that #979 does not fix it.

honglu2875 commented 1 year ago

I tried to set up neox from an empty conda env again, and used my own config (pythia-1b) but made sure

On the main branch it just errored out on the first training step instead of hanging. Applying #979, it can train normally (watched it for about 10 training steps).

Will take a look at other items in OP's config later. But hope this helps narrow down the problem.

StellaAthena commented 1 year ago

After conferring with @honglu2875 we discovered that I was failing to apply the fix to both nodes. @absol13 It should now work for you on main

absol13 commented 1 year ago

Now it works correctly. Thanks to your fast support.