Closed cateto closed 1 year ago
There are several strange things in your config file. Can tell us a bit more about what you’re trying to do? Specifically:
It seems plausible that we might have a data loader bug that only appears in excess of a MBS of 100 because we’ve probably never tested it that large. However our data loader is largely the same as Megatron-DeepSpeed’s… have you tried that code base to see if it works with a MBS of 100?
There are several strange things in your config file. Can tell us a bit more about what you’re trying to do? Specifically:
- You probably shouldn’t be using a MBS of 100 ever.
- You’re using MP = 2 and PP = 2, but your model is small enough that it should be trainable without either. You can fit the entire model on a single GPU and then do pure data parallelism.
- Is it correct that you have two nodes each with two A100s, and not four A100s on a single node?
It seems plausible that we might have a data loader bug that only appears in excess of a MBS of 100 because we’ve probably never tested it that large. However our data loader is largely the same as Megatron-DeepSpeed’s… have you tried that code base to see if it works with a MBS of 100?
This issue should have been fixed in https://github.com/EleutherAI/gpt-neox/pull/835
@cateto -- Following up on @StellaAthena's questions, please try setting model-parallel-size
and pipe-parallel-size
to 1, which will improve performance but force you to reduce the batch size. Also, please share your commit hash here so that we can investigate further if the above PR didn't fix this issue properly.
@cateto hey, any updates on this?
Solved it ! Thank you. with 2 nodes (2 * A100 80GB)
# GPT-2 pretraining setup
{
# parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
# across the node boundaries )
"pipe-parallel-size": 1,
"model-parallel-size": 1,
"num_nodes": 2,
# model settings
"num-layers": 12,
"hidden-size": 768,
"num-attention-heads": 12,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,
"gpt_j_residual": false,
"output_layer_parallelism": "column",
# these should provide some speedup but takes a while to build, set to true if desired
"scaled-upper-triang-masked-softmax-fusion": false,
"bias-gelu-fusion": false,
# init methods
"init_method": "small_init",
"output_layer_init_method": "wang_init",
# optimizer settings
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0006,
"betas": [0.9, 0.95],
"eps": 1.0e-8,
}
},
"min_lr": 0.00006,
# for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
"zero_optimization": {
"stage": 1,
"allgather_partitions": True,
"allgather_bucket_size": 500000000,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 500000000,
"contiguous_gradients": True,
},
# batch / data settings
"train_micro_batch_size_per_gpu": 55,
"gradient_accumulation_steps": 2,
"data-impl": "mmap",
#"split": "949,50,1",
# activation checkpointing
"checkpoint-activations": true,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,
# regularization
"gradient_clipping": 1.0,
"weight-decay": 0.1,
"hidden-dropout": 0.0,
"attention-dropout": 0.0,
# precision settings
"fp16": {
"fp16": true,
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
# misc. training settings
"train-iters": 100000,
"lr-decay-iters": 100000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"checkpoint-factor": 2500,
"eval-interval": 1000,
"eval-iters": 10,
# logging
"log-interval": 100,
"steps_per_print": 10,
"keep-last-n-checkpoints": 450,
"wall_clock_breakdown": true,
}
Describe the bug when i train this model by config below (train_micro_batch_size_per_gpu=100), raise runtime error. but i try to set
train_micro_batch_size_per_gpu < 100
. it works. but i want to use full gpu memory..! please let me knowEnvironment (please complete the following information):
GPUs: A100 80GB * 4
Configs(Click)
Additional context Add any other context about the problem here.