Allow to pre alloc memory for pretraining for better memory use.

PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

https://paddlenlp.readthedocs.io

Apache License 2.0

11.99k stars 2.93k forks source link

Allow to pre alloc memory for pretraining for better memory use. #8600

Open Xreki opened 3 months ago

Xreki commented 3 months ago

PR types

Others

PR changes

Others

Description

Llama-2 70B模型，训练策略tp4pp8-vpp5-mbs1-acc32（开启sp），不开启release_grads选项时能稳定训练50个step：

开启release_grads后，容易在训练若干个step后OOM，原因是release_grads功能会在每个step后释放梯度所占用的空间、在下一个step重新分配，增加了显存操作的次数，从而容易引起显存碎片。通过添加显存预分配功能（pre_alloc_memory），即预先为训练分配好一块大的显存空间，可以避免该问题。

paddle-bot[bot] commented 3 months ago

Thanks for your contribution!

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 55.81%. Comparing base (5619cc3) to head (548db29). Report is 132 commits behind head on develop.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## develop #8600 +/- ## ======================================== Coverage 55.81% 55.81% ======================================== Files 620 620 Lines 96599 96599 ======================================== Hits 53917 53917 Misses 42682 42682 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

github-actions[bot] commented 1 month ago

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。