PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.66k stars 5.44k forks source link

sharding stage1 V1 support Broadcast overlap Forward #63945

Closed iosmers closed 1 week ago

iosmers commented 3 weeks ago

PR Category

Performance Optimization

PR Types

Performance

Description

1、本PR主要是针对shrding Stage V1的param的broadcast和前向计算进行重叠,以实现性能优化 2、正确性验证,优化前和优化后逐位对齐实验结果

image

3、llama7B sharding 8,性能提升

no_overlap overlap speedup
6877 7053.3278 2.6%

4、timeline 分析 image

card-13678

paddle-bot[bot] commented 3 weeks ago

你的PR提交成功,感谢你对开源项目的贡献! 请关注后续CI自动化测试结果,详情请参考Paddle-CI手册。 Your PR has been submitted. Thanks for your contribution! Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-ci-bot[bot] commented 1 week ago

Sorry to inform you that eb36ebe's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.