Remove delay_scale_loss and release_grads for llama-2 13B's benchmark.

PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

Apache License 2.0

11.71k stars 2.86k forks source link

PR types

Others

PR changes

Others

Description

模型	训练策略	分支	训练吞吐	max memory reserved（日志中）
Llama-2 13B	pp4sharding8-vpp5-mbs1-acc4	develop	1991.236	48.738
Llama-2 13B	pp4sharding8-vpp5-mbs1-acc4	去掉release_grads	2037.899 (+2.34%)	53.602
Llama-2 13B	pp4sharding8-vpp5-mbs1-acc4	去掉delay_scale_loss	2051.128 (+0.65%)	53.602

Llama-2 13B性能提升说明：

release_grads策略可以节省峰值显存占用，但是每个训练step结束后会释放梯度所占空间，并在下一个step重新申请和初始化，故而会引入一定的开销。Llama-2 13B模型并没有打满显存，故可以移除该选项
delay_scale_loss策略是为了优化收敛，一方面相比较的竞品没有使用该策略，另一方面该策略在会引入一个设备同步、影响sharding allgather overlap的效果。 https://github.com/PaddlePaddle/PaddleNLP/blob/439f8f33950c2acc38fe5c2bfc79d3a4a848ab34/paddlenlp/trainer/trainer.py#L1112-L1120

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 54.18%. Comparing base (cd2a70e) to head (d98e9e7).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## develop #8623 +/- ## ======================================== Coverage 54.18% 54.18% ======================================== Files 625 625 Lines 98947 98947 ======================================== Hits 53618 53618 Misses 45329 45329 ```

PaddlePaddle / PaddleNLP