PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
11.99k stars 2.93k forks source link

Allow to pre alloc memory for pretraining for better memory use. #8600

Open Xreki opened 3 months ago

Xreki commented 3 months ago

PR types

Others

PR changes

Others

Description

Llama-2 70B模型,训练策略tp4pp8-vpp5-mbs1-acc32(开启sp),不开启release_grads选项时能稳定训练50个step: image

开启release_grads后,容易在训练若干个step后OOM,原因是release_grads功能会在每个step后释放梯度所占用的空间、在下一个step重新分配,增加了显存操作的次数,从而容易引起显存碎片。通过添加显存预分配功能(pre_alloc_memory),即预先为训练分配好一块大的显存空间,可以避免该问题。

paddle-bot[bot] commented 3 months ago

Thanks for your contribution!

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 55.81%. Comparing base (5619cc3) to head (548db29). Report is 132 commits behind head on develop.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #8600 +/- ## ======================================== Coverage 55.81% 55.81% ======================================== Files 620 620 Lines 96599 96599 ======================================== Hits 53917 53917 Misses 42682 42682 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

github-actions[bot] commented 1 month ago

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。