PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.29k stars 5.61k forks source link

受到 pr_50915 合入影响,AFQMC_base, AFQMC_PTQ_1 模型在 develop 分支多环境下执行 trt 推理精度下降 #53525

Closed EmmonsCurse closed 1 year ago

EmmonsCurse commented 1 year ago

bug描述 Describe the Bug

错误信息

错误引入 PR:https://github.com/PaddlePaddle/Paddle/pull/50915

case 地址:https://github.com/PaddlePaddle/PaddleTest/tree/develop/inference/python_api_test/test_nlp_model

镜像(推荐): registry.baidubce.com/paddlepaddle/paddle_manylinux_devel:cuda10.2-cudnn7.6-trt7.0-gcc8.2

错误类型:AFQMC_base, AFQMC_PTQ_1 模型在 develop 分支多环境下执行 trt 推理精度下降

  1. test_AFQMC_PTQ_trt_int8:
    • 执行 diff 下降报错信息:E AssertionError: total: 80 diff count:80 max:1.9401946067810059 delta:0.4
    • 正常执行结果:total: 80 diff count:0 max:0.09548187255859375 delta:0.4
  2. test_AFQMC_base_trt_fp32:
    • 执行 diff 下降报错信息:E AssertionError: total: 80 diff count:80 max:0.1817956417798996 delta:1e-05
    • 正常执行结果:total: 80 diff count:0 max:5.513429641723633e-07 delta:1e-05
  3. test_AFQMC_base_trt_fp16:
    • 执行 diff 下降报错信息:AssertionError: total: 80 diff count:80 max:0.18234875798225403 delta:0.01
    • 正常执行结果:total: 80 diff count:0 max:0.0031063444912433624 delta:0.01

Bug 复现步骤

环境配置

执行步骤:

1、基于对应镜像构建容器并配置好环境 1.1 起容器后进入 1.2 配置依赖

export LD_LIBRARY_PATH=/opt/_internal/cpython-3.7.0/lib/:${LD_LIBRARY_PATH}
export PATH=/opt/_internal/cpython-3.7.0/bin/:${PATH}
export PYTHON_FLAGS="-DPYTHON_EXECUTABLE:FILEPATH=/opt/_internal/cpython-3.7.0/bin/python3.7 -DPYTHON_INCLUDE_DIR:PATH=/opt/_internal/cpython-3.7.0/include/python3.7 -DPYTHON_LIBRARIES:FILEPATH=/opt/_internal/cpython-3.7.0/lib/libpython3.so"
ln -s /usr/lib64/libnvidia-ml.so.* /usr/lib64/libnvidia-ml.so.1;
ldconfig;

1.3 下载 case 并安装 case 依赖以及推荐镜像对应的 paddle whl:

git clone https://github.com/PaddlePaddle/PaddleTest.git --depth=1

cd ./PaddleTest/inference/python_api_test

python -m pip install -r ./requirements.txt -i https://mirror.baidu.com/pypi/simple

wget -q --no-proxy --no-check-certificate https://paddle-qa.bj.bcebos.com/paddle-pipeline/Develop-GpuAll-LinuxCentos-Gcc82-Cuda102-Trtoff-Py37-Compile/9c4065316d89f3f57a9b58437104a29d1837b84a/paddlepaddle_gpu-0.0.0-cp37-cp37m-linux_x86_64.whl

python -m pip install paddlepaddle_gpu-0.0.0-cp37-cp37m-linux_x86_64.whl

当前 commit 编包(Failed):https://paddle-qa.bj.bcebos.com/paddle-pipeline/Develop-GpuAll-LinuxCentos-Gcc82-Cuda102-Trtoff-Py37-Compile/9c4065316d89f3f57a9b58437104a29d1837b84a/paddlepaddle_gpu-0.0.0-cp37-cp37m-linux_x86_64.whl 前一个 commit 编包(Passed):https://paddle-qa.bj.bcebos.com/paddle-pipeline/Develop-GpuAll-LinuxCentos-Gcc82-Cuda102-Trtoff-Py37-Compile/64adfe7a16929c92536be5c0e0699a7bc8db053d/paddlepaddle_gpu-0.0.0-cp37-cp37m-linux_x86_64.whl

编译命令:cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_GPU=ON -DWITH_CUDNN_DSO=OFF -DWITH_TENSORRT=ON -DWITH_ROCM=OFF -DWITH_CINN=OFF -DWITH_DISTRIBUTE=ON -DWITH_MKL=ON -DWITH_AVX=ON -DCUDA_ARCH_NAME=Manual -DNEW_RELEASE_PYPI=ON -DNEW_RELEASE_ALL=OFF -DNEW_RELEASE_JIT=OFF -DWITH_PYTHON=ON -DCUDNN_ROOT=/usr/ -DWITH_TESTING=OFF -DWITH_COVERAGE=OFF -DWITH_INCREMENTAL_COVERAGE=OFF -DCMAKE_MODULE_PATH=/opt/rocm/hip/cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_CONTRIB=ON -DWITH_INFERENCE_API_TEST=ON -DINFERENCE_DEMO_INSTALL_DIR=/root/.cache/inference_demo -DPY_VERSION=3.7 -DCMAKE_INSTALL_PREFIX=/paddle/build -DWITH_PSCORE=ON -DWITH_PSLIB=OFF -DWITH_GLOO=ON -DLITE_GIT_TAG=release/v2.10 -DWITH_XPU=OFF -DWITH_IPU=OFF -DXPU_SDK_ROOT= -DWITH_LITE=OFF -DWITH_XPU_BKCL=OFF -DWITH_ARM=OFF -DWITH_STRIP=ON -DON_INFER=ON -DWITH_HETERPS=OFF -DWITH_GPU_GRAPH=OFF -DWITH_FLUID_ONLY=OFF -DCUDA_ARCH_BIN=52 60 61 70 75 -DWITH_RECORD_BUILDTIME=OFF -DWITH_UNITY_BUILD=OFF -DWITH_ONNXRUNTIME=OFF -DWITH_CUDNN_FRONTEND=OFF

1.4 执行 case

cd ./test_nlp_model
python -m pytest -m server -k trt --disable-warnings -sv ./test_AFQMC_PTQ_trt_int8.py
python -m pytest -m server -k trt --disable-warnings -sv ./test_AFQMC_base_trt_fp32.py
python -m pytest -m server -k trt --disable-warnings -sv ./test_AFQMC_base_trt_fp16.py

其他补充信息 Additional Supplementary Information

@jinyouzhi @luotao1 @zhangting2020 辛苦处理一下。

luotao1 commented 1 year ago

请 @jinyouzhi 帮忙看下。

jinyouzhi commented 1 year ago

请 @jinyouzhi 帮忙看下。

好的,我复现一下