Fixed linear time increase observed when micro=1

epfLLM / Megatron-LLM

distributed trainer for LLMs

Other

529 stars 76 forks source link

Fixed linear time increase observed when micro=1 #62

Closed AleHD closed 1 year ago

AleHD commented 1 year ago

Turns out that for some reason, using torch.jit on the glu activation was the culprit. Removing this seems to totally fix the problem during my tests.

AleHD commented 1 year ago

Tested with llama2, tp=4, pp=1 on two 8x 80GB A100 nodes (dp=4)

AleHD commented 1 year ago

Good question. It actually seems to be like 8% faster under normal circumstances (tp=4, pp=1, dp=4, 2 nodes w/ 8x 80GB A100; micro=5, global=100). Previous build stabilizes around 12.8 sec/iter, removing torchscript increases performance to 11.8 sec/iter :)