Your question
I run pretrain_gpt on same arch, data, training hyperparams and same hardware, with and without using megatron_core when build the model.
I notice clearly worse wall clock time and memory usage:
setting
wall clock time per step(ms)
mem per gpu(GB)
legacy
630
45
use_mcore
690
63
Environment:
hardware
torch version
cuda version
A100-80G-PCIe x 4
2.1.2
12.2
For the data I use c4_en data from huggingface and tokenize it using gpt2 tokenizer. I use the first 3.6e7(first 10%) document to conduct the experiments.
To Reproduce
megatron-lm commit hash: 9de386d08770d7296263a590171ace4ae45348ad
I customize a script from pretrain_gpt_distributed.sh and rename it as pretrain_gpt_cli.sh
For the data I use c4_en data from huggingface and tokenize it using gpt2 tokenizer. I use the first 3.6e7(first 10%) document to conduct the experiments.
To Reproduce megatron-lm commit hash: 9de386d08770d7296263a590171ace4ae45348ad I customize a script from pretrain_gpt_distributed.sh and rename it as
pretrain_gpt_cli.sh
To reproduce the experiment, please run following bash command:
Is there any reason behind this?