PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.05k stars 5.54k forks source link

动静态图模式下的精度区别 #62679

Open TimeYWL opened 5 months ago

TimeYWL commented 5 months ago

请提出你的问题 Please ask your question

我在海光dcu上运行paddle的单元测试: https://github.com/PaddlePaddle/Paddle/blob/3f77e6a4543e167373a6e21c48638fc213d2a20b/test/legacy_test/test_adamw_op.py#L718 在该case给定的R=1e-6的标准下,运行到step=3时,未能通过精度测试。但我使用静态图模式搭建了相同的测试模型,却可以通过精度测试。 我对比了两种模式下给出的中间结果,大部分数据都能精准对上,但有极少部分存在有效位数上的差异。 请问动态图和静态图模式,在精度上有所区别吗?这种区别是如何产生的呢? 附静态图代码:

class TestAdamWOpLayerwiseLR(unittest.TestCase):
    def test_adamw_op(self):
        paddle.enable_static()
        place = fluid.CUDAPlace(0)

        learning_rate = 0.001
        beta1 = 0.9
        beta2 = 0.999
        weight_decay = 0.01
        epsilon = 1e-8

        train_prog = fluid.Program()
        startup = fluid.Program()
        with fluid.program_guard(train_prog, startup):
            with fluid.unique_name.guard():
                x = paddle.static.data(
                    name='x', shape=[None, 13], dtype='float32'
                )

                weight_attr1 = paddle.framework.ParamAttr(name="linear_1.w_0")
                bias_attr1 = paddle.framework.ParamAttr(
                    name="linear_1.b_0",
                    initializer=paddle.nn.initializer.Constant(value=1.0),
                )
                weight_attr2 = paddle.framework.ParamAttr(name="linear_2.w_0")
                bias_attr2 = paddle.framework.ParamAttr(
                    name="linear_2.b_0",
                    initializer=paddle.nn.initializer.Constant(value=1.0),
                )
                linear1 = paddle.nn.Linear(
                    13, 8, weight_attr=weight_attr1, bias_attr=bias_attr1
                )
                linear2 = paddle.nn.Linear(
                    8, 5, weight_attr=weight_attr2, bias_attr=bias_attr2
                )

                out = linear1(x)
                out = linear2(out)

                fc1_w_mon1 = np.zeros(linear1.weight.shape).astype("float32")
                fc1_w_mon2 = np.zeros(linear1.weight.shape).astype("float32")
                fc1_b_mon1 = np.zeros(linear1.bias.shape).astype("float32")
                fc1_b_mon2 = np.zeros(linear1.bias.shape).astype("float32")
                fc2_w_mon1 = np.zeros(linear2.weight.shape).astype("float32")
                fc2_w_mon2 = np.zeros(linear2.weight.shape).astype("float32")
                fc2_b_mon1 = np.zeros(linear2.bias.shape).astype("float32")
                fc2_b_mon2 = np.zeros(linear2.bias.shape).astype("float32")

                avg_cost = paddle.mean(out)

                simple_lr_fun = partial(
                    simple_lr_setting, decay_rate=0.8, n_layers=2
                )

                opt = paddle.optimizer.AdamW(
                    learning_rate=learning_rate,
                    beta1=beta1,
                    beta2=beta2,
                    weight_decay=weight_decay,
                    epsilon=epsilon,
                    lr_ratio=simple_lr_fun,
                )
                opt.minimize(avg_cost)
        fetch_list1 = [
            "linear_1.w_0",
            "linear_1.b_0",
            "linear_2.w_0",
            "linear_2.b_0",
        ]
        fetch_list2 = [
            "linear_1.w_0",
            "linear_1.w_0@GRAD",
            "linear_1.b_0",
            "linear_1.b_0@GRAD",
            "linear_2.w_0",
            "linear_2.w_0@GRAD",
            "linear_2.b_0",
            "linear_2.b_0@GRAD",
        ]

        exe = fluid.Executor(place)
        exe.run(startup)
        test_prog = train_prog.clone(for_test=True)

        for i in range(5):
            inputs = np.random.uniform(-1, 1, (2, 13)).astype("float32")
            param = exe.run(
                test_prog,
                feed={"x": inputs},
                fetch_list=fetch_list1,
            )
            params_and_gras = exe.run(
                train_prog,
                feed={"x": inputs},
                fetch_list=fetch_list2,
            )
            fc1_w = param[0].numpy()
            fc1_w_grad = params_and_gras[1].numpy()
            fc1_b = param[1].numpy()
            fc1_b_grad = params_and_gras[3].numpy()
            fc2_w = param[2].numpy()
            fc2_w_grad = params_and_gras[5].numpy()
            fc2_b = param[3].numpy()
            fc2_b_grad = params_and_gras[7].numpy()
            fc1_w, fc1_w_mon1, fc1_w_mon2 = get_numpy_output(
                fc1_w,
                fc1_w_grad,
                fc1_w_mon1,
                fc1_w_mon2,
                simple_lr_fun(linear1.weight),
                i + 1,
            )
            fc1_b, fc1_b_mon1, fc1_b_mon2 = get_numpy_output(
                fc1_b,
                fc1_b_grad,
                fc1_b_mon1,
                fc1_b_mon2,
                simple_lr_fun(linear1.bias),
                i + 1,
            )
            fc2_w, fc2_w_mon1, fc2_w_mon2 = get_numpy_output(
                fc2_w,
                fc2_w_grad,
                fc2_w_mon1,
                fc2_w_mon2,
                simple_lr_fun(linear2.weight),
                i + 1,
            )
            fc2_b, fc2_b_mon1, fc2_b_mon2 = get_numpy_output(
                fc2_b,
                fc2_b_grad,
                fc2_b_mon1,
                fc2_b_mon2,
                simple_lr_fun(linear2.bias),
                i + 1,
            )
            print(params_and_gras[0])
            np.testing.assert_allclose(params_and_gras[0].numpy(), fc1_w, rtol=1e-6)
            np.testing.assert_allclose(params_and_gras[2].numpy(), fc1_b, rtol=1e-6)
            np.testing.assert_allclose(params_and_gras[4].numpy(), fc2_w, rtol=1e-6)
            np.testing.assert_allclose(params_and_gras[6].numpy(), fc2_b, rtol=1e-6)
qili93 commented 5 months ago

您好,这个精度差异可能是由于算子差异导致的,飞桨框架的静态图和动态图只是2种运行模式不会带来精度差异,但是在2种不同方式下可能会运行不同的算子kernel,应该是某个算子kernel导致的精度差异。

您可以尝试打开GLOG_v dump一下分别在静态图和动态图下运行到的算子kernel对比下是否存在某个kernel在动态图下跑到了,但是在静态图下没有跑到。

谢谢!