PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.22k stars 5.58k forks source link

[bug]pr_60629 encountered another error of "RuntimeError: (PreconditionNotMet)" while running inference on the rec_r34_vd_tps_bilstm_attn model. #60957

Closed EmmonsCurse closed 9 months ago

EmmonsCurse commented 9 months ago

bug描述 Describe the Bug

0. Error Description

After the merge of https://github.com/PaddlePaddle/Paddle/pull/60629 which 'fix bug for program_converter', another error occurred while running inference on the rec_r34_vd_tps_bilstm_attn model as shown below:

1. GPU:

test_rec_r34_vd_tps_bilstm_attn_gpu.py:46: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_gpu_helper.py:75: in get_infer_results
    AnalysisPredictor = Predictor(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <src.test_case.Predictor object at 0x7fdcbdbbacd0>
model_path = '/vdb1/112_workspace/continuous_integration/inference/inference_api_test/python_api_test/Data/python-ocr-infer/rec_r34_vd_tps_bilstm_attn'
predictor_mode = 'Analysis', config_type = 'gpu', batch_size = 1, min_subgraph_size = 1, trt_dynamic_shape_info = None

    def __init__(self,
                 model_path,
                 predictor_mode="Analysis",
                 config_type="cpu",
                 batch_size=1,
                 min_subgraph_size=1,
                 trt_dynamic_shape_info=None):
        """
        init configuration of predictor
        Args:
            model_path(string): the path of test model
            predictor_mode(strings): create native or analysis predictor
            config_type(strings): describe analysis prediction configuration
        """
        configs = DeployConfig(
            model_path=model_path,
            batch_size=batch_size,
            min_subgraph_size=min_subgraph_size,
            trt_dynamic_shape_info=trt_dynamic_shape_info)
        analysis_predictor_config = configs.analysis_config(config_type)

        logger.debug("analysis_predictor_config : {}".format(
            analysis_predictor_config))
        configs.summary_config(analysis_predictor_config)  # summary configs

        if predictor_mode == "Analysis":
            logger.info("current config is Analysis config")
>           predictor0 = base.core.create_paddle_predictor(
                analysis_predictor_config)
E           RuntimeError: (PreconditionNotMet) Tensor's dimension is out of bound.Tensor's dimension must be equal or less than the size of its memory.But received Tensor's dimension is 2116, memory's size is 0.
E             [Hint: Expected numel() * SizeOf(dtype()) <= memory_size(), but received numel() * SizeOf(dtype()):2116 > memory_size():0.] (at ../paddle/phi/core/dense_tensor_impl.cc:55)

2. CPU

test_rec_r34_vd_tps_bilstm_attn_cpu.py:45: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_cpu_helper.py:75: in get_infer_results
    res, ave_time = AnalysisPredictor.analysis_predict(data_path, repeats=2)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <src.test_case.Predictor object at 0x7fe670cfb0a0>
json_dir = '/vdb1/112_workspace/continuous_integration/inference/inference_api_test/python_api_test/Data/python-ocr-infer/word_rec_data_3_32_100/data.json'
repeats = 2

    def analysis_predict(self, json_dir, repeats=1):
        """
        use zero copy and analysis config to predict
        Args:
            json_dir(string) : "*.json"
            repeats(int)
        Returns:
            outputs(list|[numpy.array, numpy.array]): list of numpy array
            ave_time(float): infer speed
        """
        # parse json from data file
        input_info = JsonInfo().parse_json(json_dir)
        # assign data to Tensor
        input_names = self.predictor.get_input_names()
        for i, input_data_name in enumerate(input_names):
            record = Record().load_data_from_json(input_info[i])
            record = next(record)
            logger.info("====> input_names[{0}] = {1} <====".format(
                i, input_names[i]))
            input_tensor = self.predictor.get_input_tensor(input_data_name)
            logger.debug("record.data shape is {}".format(record.data.shape))
            input_tensor.copy_from_cpu(record.data)
            if hasattr(record, 'lod'):
                input_tensor.set_lod([record.lod])

        cost_time = []
        for i in range(repeats):
            t1 = time.time()

>           self.predictor.zero_copy_run()
E           RuntimeError: In user code:
E           
E               File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op
E               attrs=kwargs.get("attrs", None))
E           
E               File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
E               return self.main_program.current_block().append_op(*args, **kwargs)
E           
E               File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 6416, in matmul
E               attrs=attrs)
E           
E               File "/workspace/PaddleOCR_deplpy/PaddleOCR/ppocr/modeling/stns/tps.py", line 242, in __call__
E               batch_T = layers.matmul(inv_delta_C_tensor, batch_C_prime_with_zeros)
E           
E               File "/workspace/PaddleOCR_deplpy/PaddleOCR/ppocr/modeling/stns/tps.py", line 256, in __call__
E               batch_P_prime = self.grid_generator(batch_C_prime, I_r_size)
E           
E               File "/workspace/PaddleOCR_deplpy/PaddleOCR/ppocr/modeling/architectures/rec_model.py", line 110, in __call__
E               inputs = self.tps(image)
E           
E               File "/workspace/PaddleOCR_deplpy/PaddleOCR/tools/program.py", line 198, in build_export
E               image, outputs = model(mode='export')
E           
E               File "tools/export_model.py", line 67, in main
E               config, eval_program, startup_prog)
E           
E               File "tools/export_model.py", line 93, in <module>
E               main()
E           
E           
E               PreconditionNotMetError: Tensor's dimension is out of bound.Tensor's dimension must be equal or less than the size of its memory.But received Tensor's dimension is 2116, memory's size is 0.
E                 [Hint: Expected numel() * SizeOf(dtype()) <= memory_size(), but received numel() * SizeOf(dtype()):2116 > memory_size():0.] (at ../paddle/phi/core/dense_tensor_impl.cc:55)
E                 [operator < matmul > error]

../../src/test_case.py:282: RuntimeError

1. Operating environment

PaddlePaddle version: develop OS version: CentOS Python version: 3.8 GPU: T4 CPU info: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz

2. The problem recurrence steps

git clone https://github.com/PaddlePaddle/continuous_integration.git --depth=1

cd ./continuous_integration/inference/inference_api_test/python_api_test/

project_path=`pwd`
export project_path
cd ${project_path}

# download Data

mkdir -p ./Data
cd ./Data

# download models

wget --no-proxy -q https://sys-p0.bj.bcebos.com/inference/python-ocr-infer.tgz --no-check-certificate
tar -xvf python-ocr-infer.tgz

cd -

# requirements
python -m pip install --upgrade pip -i https://mirror.baidu.com/pypi/simple
python -m pip install -r requirements.txt -i https://mirror.baidu.com/pypi/simple

# download paddlepaddle_whl

## error_pr(failed)
#wget -q https://paddle-qa.bj.bcebos.com/paddle-pipeline/Develop-GpuAll-LinuxCentos-Gcc82-Cuda112-Trtoff-Py38-Compile/d310158ecfaa49726af1c903e59ad535aa496808/paddlepaddle_gpu-0.0.0-cp38-cp38-linux_x86_64.whl

## the previous commit of error_pr(passed)
#wget -q https://paddle-qa.bj.bcebos.com/paddle-pipeline/Develop-GpuAll-LinuxCentos-Gcc82-Cuda112-Trtoff-Py38-Compile/2f5efb0287c1b66fb5446eb7fb8e5490dc1fd102/paddlepaddle_gpu-0.0.0-cp38-cp38-linux_x86_64.whl

# install paddlepaddle_whl
python -m pip install paddlepaddle_gpu-0.0.0-cp38-cp38-linux_x86_64.whl

# run case
cd ./tests/gpu
python -m pytest -sv test_rec_r34_vd_tps_bilstm_attn_gpu.py

cd ../cpu
python -m pytest -sv test_rec_r34_vd_tps_bilstm_attn_cpu.py

其他补充信息 Additional Supplementary Information

@zyt1024 Can you help me solve this?

zyt1024 commented 9 months ago

Ok, I will attempt to reproduce it and verify if the error is caused by this pull request. #60629

zyt1024 commented 9 months ago

@EmmonsCurse 您好,这个PR的修改并未涉及到matmul的功能,您能再确定一下该问题是因为该PR [fix bug]fix bug for program_converter 引起的吗

EmmonsCurse commented 9 months ago

@EmmonsCurse 您好,这个PR的修改并未涉及到matmul的功能,您能再确定一下该问题是因为该PR [fix bug]fix bug for program_converter 引起的吗

@zyt1024 您好,首先,请注意查看全部的报错信息,[operator < matmul > error] 只是 CPU 报错信息的一部分,还有其他的报错信息;其次,相应的复现方法我已附上,已明确给出该问题是在您的 PR 合入后出现的,基于前一个 commit 的编包是执行正常的。如有疑问,请参考给出的复现方法,执行复现即可验证,谢谢。 image

## error_pr(failed)
#wget -q https://paddle-qa.bj.bcebos.com/paddle-pipeline/Develop-GpuAll-LinuxCentos-Gcc82-Cuda112-Trtoff-Py38-Compile/d310158ecfaa49726af1c903e59ad535aa496808/paddlepaddle_gpu-0.0.0-cp38-cp38-linux_x86_64.whl

## the previous commit of error_pr(passed)
#wget -q https://paddle-qa.bj.bcebos.com/paddle-pipeline/Develop-GpuAll-LinuxCentos-Gcc82-Cuda112-Trtoff-Py38-Compile/2f5efb0287c1b66fb5446eb7fb8e5490dc1fd102/paddlepaddle_gpu-0.0.0-cp38-cp38-linux_x86_64.whl
zyt1024 commented 9 months ago

@EmmonsCurse 谢谢您进行验证,可以帮忙验证下 这个PR https://github.com/PaddlePaddle/Paddle/pull/61051 合并后是否还会产生其他的bug。

EmmonsCurse commented 9 months ago

@EmmonsCurse 谢谢您进行验证,可以帮忙验证下 这个PR #61051 合并后是否还会产生其他的bug。

@zyt1024 OK 👌