epfLLM / Megatron-LLM

distributed trainer for LLMs
Other
529 stars 76 forks source link

iteration-time increases linearly when micro_batch_size=1 #60

Closed LlinWing closed 1 year ago

LlinWing commented 1 year ago

I reported this issue last time in the issue #22

After conducting extensive investigation, I finally found that this issue only occurs when setting micro_batch_size=1. So I decided to open a new issue to emphasize this point.

I believe you can reproduce this issue because I pulled your latest code and ran it with micro_batch_size=1, and the problem still persisted. It returned to normal after setting micro_batch_size to 2. (both experiments` settings are tp=2 and pp=4 on 8 * A100 40G)

martinjaggi commented 1 year ago

thanks, that's very interesting, and might explain why we never run into the problem ourselves during the real training runs. will investigate more.

in the meantime, all models seem to train fine with the configs we recommend in the docu