PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.2k stars 5.58k forks source link

【Conv2D的前向结果在CPU和GPU上存在差异】 #67354

Open zhudequan9 opened 2 months ago

zhudequan9 commented 2 months ago

bug描述 Describe the Bug

BUG:

一个仅包含2层Conv2D的模型,分别设置在CPU和GPU上,提取相同权重,输入相同数据; 对比输出结果:只有三位有效数字对齐。

测试环境

测试环境是用官方docker: https://www.paddlepaddle.org.cn/documentation/docs/zh/install/docker/docker_list.html 中的registry.baidubce.com/paddlepaddle/paddle:2.6.1-gpu-cuda12.0-cudnn8.9-trt8.6

官方docker镜像【BUG】:

  1. paddle实际引用的cudnn版本和docker镜像的描述不对应,如我这里的是8.9,实际输出8.8,并给了一个WARNING
  2. 发现有两个镜像里面并没有paddle【registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda11.8-cudnn8.6-trt8.5-gcc82】和【registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.3-cudnn9.0-trt8.6-gcc12.2】

    测试代码【模型权重文件附在最后】

    if __name__ == '__main__':
    import paddle
    import numpy as np
    
    class PaddleModel(paddle.nn.Layer):
    
        def __init__(self):
            super(PaddleModel, self).__init__()
            self.conv1 = paddle.nn.Conv2D(in_channels=1, out_channels=32, kernel_size=3, stride=1, padding=0)
            self.conv2 = paddle.nn.Conv2D(in_channels=32, out_channels=1, kernel_size=3, stride=1, padding=0)
    
        def forward(self, x):
            x = self.conv1(x)
            x = self.conv2(x)
            return x
    
    np.random.seed(42)
    fake_data = np.random.rand(1, 1, 5, 5).astype(np.float32)
    model_state_dict = paddle.load("paddle_model_weights.pdparams")
    
    paddle.set_device("gpu")
    gpu_model = PaddleModel()
    gpu_model.set_state_dict(model_state_dict)
    gpu_input = paddle.to_tensor(fake_data, dtype="float32")
    gpu_output = gpu_model(gpu_input)
    
    paddle.set_device("cpu")
    cpu_model = PaddleModel()
    cpu_model.set_state_dict(model_state_dict)
    cpu_input = paddle.to_tensor(fake_data, dtype="float32")
    cpu_output = cpu_model(cpu_input)
    
    print(f"{gpu_output}\n{cpu_output}")

    输出结果

    
    grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
    W0812 13:25:28.579708  4051 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.0, Runtime API Version: 12.0
    W0812 13:25:28.580489  4051 gpu_resources.cc:164] device: 0, cuDNN Version: 8.8.
    W0812 13:25:28.581843  4051 gpu_resources.cc:299] WARNING: device:  . The installed Paddle is compiled with CUDNN 8.9, but CUDNN version in your machine is 8.8, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version.
    Tensor(shape=[1, 1, 1, 1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [[[[0.80317044]]]])
    Tensor(shape=[1, 1, 1, 1], dtype=float32, place=Place(cpu), stop_gradient=False,
       [[[[0.80359638]]]])

Process finished with exit code 0


### 模型权重文件
[paddle_model_weights.zip](https://github.com/user-attachments/files/16581840/paddle_model_weights.zip)

### 其他补充信息 Additional Supplementary Information

_No response_
FeixLiu commented 2 months ago

收到您的问题,我让我们api负责人看一下。 @jerrywgz 辛苦帮忙看下这个问题

jerrywgz commented 2 months ago

使用gpu计算卷积会调用cudnn算法,可能会存在计算顺序和cpu不一致,可以尝试下几个flag

export FLAGS_embedding_deterministic=1
export FLAGS_cudnn_deterministic=1
export NVIDIA_TF32_OVERRIDE=0
export NCCL_ALGO=Tree

我在A100上测试paddle 3.0beta版本是能对齐的,配置可以看下面的日志 image

zhudequan9 commented 2 months ago

使用gpu计算卷积会调用cudnn算法,可能会存在计算顺序和cpu不一致,可以尝试下几个flag

export FLAGS_embedding_deterministic=1
export FLAGS_cudnn_deterministic=1
export NVIDIA_TF32_OVERRIDE=0
export NCCL_ALGO=Tree

我在A100上测试paddle 3.0beta版本是能对齐的,配置可以看下面的日志 image

感谢您的回复! 我加上了export指令,用registry.baidubce.com/paddlepaddle/paddle:3.0.0b1-gpu-cuda11.8-cudnn8.6-trt8.5测试后,也没有完全对齐。

690c3c80322ed78f78740387f75ac16
jerrywgz commented 2 months ago

我再对比下,现在看起来你这里gpu的计算精度和我这边是一致的,cpu有微小差异

jerrywgz commented 2 months ago

是否可以发下CPU对应的型号呢