Distributed training with model parallelism hangs with the recent PR

absol13 commented 1 year ago

Describe the bug Hello, I found distributed training with the setting "model-parallel-size": >1 hangs. This situation appears in the source with the PR #958 is merged, and it did not appear in older sources or with the setting "model-parallel-size": 1 at all.

To Reproduce I share my config file below for reproduction.

Expected behavior Training should proceed further.

Proposed solution

Screenshots Specifically, training does not proceed at this point:

gpt-neox-train-dist-master: time (ms) | model and optimizer: 49512.12 | train/valid/test data iterators: 1488.60
gpt-neox-train-dist-master: training ...
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:553:forward] Activation Checkpointing Information
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:554:forward] ----Partition Activations True, CPU CHECKPOINTING False
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:557:forward] ----contiguous Memory Checkpointing False with 32 total layers
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:560:forward] ----Synchronization True
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:561:forward] ----Profiling time in checkpointing False

Also, I report logs from nvidia-smi command.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  15699MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   37C    P0    91W / 400W |  15247MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:48:00.0 Off |                    0 |
| N/A   37C    P0    99W / 400W |  15723MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4C:00.0 Off |                    0 |
| N/A   39C    P0    97W / 400W |  15259MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   36C    P0    93W / 400W |  15723MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:8B:00.0 Off |                    0 |
| N/A   40C    P0    93W / 400W |  15259MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:C8:00.0 Off |                    0 |
| N/A   39C    P0   105W / 400W |  15699MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:CB:00.0 Off |                    0 |
| N/A   37C    P0    90W / 400W |  15247MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

In the normal training procedure, GPU power usage reaches near its capacity, but I observed that it is far from its capacity which seems to mean the hanging process.

Environment (please complete the following information):

GPUs: 2 nodes with 8 NVIDIA A100 GPUs with NVLINK, connected with IB.

Configs: I share my config file for training to help reproducing this situation.


{

"hostfile": "---",
# Tokenizer /  checkpoint settings - you will need to change these to the location you have them saved in
"vocab-file": "---/gptneox/tokenizer.json",
"save": "./ckpnt",
"load": "./ckpnt",

# If finetuning, edit the following to the location of your finetuning dataset:
"data-path": "---/dataset/pile_00/pile_00_text_document",

# parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
# across the node boundaries )
"pipe-parallel-size": 1,
"model-parallel-size": 2, 
#"make_vocab_size_divisible_by": 1,

# model settings
"num-layers": 32,
"hidden-size": 4096,
"num-attention-heads": 32,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "rmsnorm",
"rms_norm_epsilon": 1.0e-6,
"pos-emb": "alibi",
"no-weight-tying": true,
"attention-config": [[["flash"], 32]],
"gpt_j_residual": false,
"output_layer_parallelism": "column",

"scaled-upper-triang-masked-softmax-fusion": true,
"bias-gelu-fusion": false,
#"use_bias_in_norms": false,
#"use_bias_in_attn_linear": false,
#"mlp_type": "llama",
#"activation": "silu",

# init methods
"init_method": "small_init",
"output_layer_init_method": "wang_init",

# optimizer settings
"optimizer": {
"type": "Adam",
"params": {
  "lr": 0.00012,
  "betas": [0.9, 0.95],
  "eps": 1.0e-8,
  }
},

"min_lr": 0.00012,

# for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
"zero_optimization": {
"stage": 1,
"allgather_partitions": True,
"allgather_bucket_size": 1260000000,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 1260000000,
"contiguous_gradients": True,
},

# batch / data settings (assuming 96 GPUs)
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"data-impl": "mmap",
"split": "995,4,1",

# activation checkpointing
"checkpoint-activations": true,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,

# regularization
"gradient_clipping": 1.0,
"weight-decay": 0.01,
"hidden-dropout": 0,
"attention-dropout": 0,

# precision settings
"fp16": {
"enabled": true,
"type": "bfloat16",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1
},
"fp32_allreduce": True,

# misc. training settings
"train-iters": 150,
"lr-decay-iters": 150000,

"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"checkpoint-factor": 500, # this variable previously called `save-interval`
"eval-interval": 1000,
"eval-iters": 10,

# logging
"log-interval": 1,
"steps_per_print": 1,
"keep-last-n-checkpoints": 1,

### NEW DATA: ####
"tokenizer_type": "HFTokenizer",
"tensorboard-dir": "./tensorboard",
"log-dir": "./logs",



**Additional context**
Add any other context about the problem here.

StellaAthena commented 1 year ago

@honglu2875

honglu2875 commented 1 year ago

@absol13 Would #979 fix the issue? It was my oversight that I didn't realize data could be None. It could be possible that it hangs because some process errored out due to this bug.

absol13 commented 1 year ago

@honglu2875 I don't know the exact reason. Anyway, I will try distributed training after #979 is merged and let you know this bug is solved. Thanks.

StellaAthena commented 1 year ago

I have reproduced this issue and confirmed that #979 does not fix it.

honglu2875 commented 1 year ago

I tried to set up neox from an empty conda env again, and used my own config (pythia-1b) but made sure

it was run on 2 full nodes
pp=1 and mp=2, zero stage 1

On the main branch it just errored out on the first training step instead of hanging. Applying #979, it can train normally (watched it for about 10 training steps).

Will take a look at other items in OP's config later. But hope this helps narrow down the problem.

StellaAthena commented 1 year ago

After conferring with @honglu2875 we discovered that I was failing to apply the fix to both nodes. @absol13 It should now work for you on main

absol13 commented 1 year ago

Now it works correctly. Thanks to your fast support.

EleutherAI / gpt-neox

Distributed training with model parallelism hangs with the recent PR #985