PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.8k stars 5.47k forks source link

[Targeting 2024 Q4] ExponentialMovingAverage does not work with fleet DistributedStrategy #51779

Open Tom-Zheng opened 1 year ago

Tom-Zheng commented 1 year ago

bug描述 Describe the Bug

目前,EMA只有在开启without_graph_optimization = True 时结果才正确,详见: https://github.com/PaddlePaddle/Paddle/blob/ea22fdb0aecfc51258c04a73ab801783b7e163d9/python/paddle/fluid/tests/unittests/test_ema_fleet.py#L39 如果删去该行,那么test就会fail.

这样做副作用很大,会完全禁用DistributedStrategy,导致EMA无法和其他优化项共存,应该被视为Bug.

其他补充信息 Additional Supplementary Information

No response

ForFishes commented 1 year ago

您好,EMA暂时只是支持在纯program下运行。其他策略下,需要额外的适配EMA。

Tom-Zheng commented 1 year ago

我们目前在PPYOLOE+的优化中需要用到。如果不开启EMA,会导致AP下降0.7% (53.5 -> 52.8%). 请考虑是否需要支持。

LiYuRio commented 1 year ago

这行的作用只是不用ParallelExecutor执行program,采用原始的executor。

请问是在静态图下吗,而且需要用ParallelExecutor做图优化?现在框架里已用新执行器代替ParallelExecutor,性能基本持平,能再详细说一下使用场景?

Tom-Zheng commented 1 year ago

根据之前的讨论,此issue和Paddle执行机制相关,开启EMA会导致 ir graph pass不被执行,需要找相关负责人修复。cc: @LiYuRio

jeng1220 commented 1 year ago

等 Q3 後再討論此事

onecatcn commented 5 months ago

since the ppyoloe project is on hold, we will check the issue in 24H2

Tom-Zheng commented 1 month ago

Move to Q4