PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.2k stars 5.57k forks source link

【论文复现】GPU下的推理报错 #46385

Closed fuqianya closed 2 years ago

fuqianya commented 2 years ago

请提出你的问题 Please ask your question

运行环境:

模型在CPU下的推理正常,但是在GPU下的推理报如下的错:

(External) CUDA error(700), an illegal memory access was encountered.

具体报错如下:

➜ python deploy/inference_python/infer.py --use-gpu True
Traceback (most recent call last):
  File "deploy/inference_python/infer.py", line 213, in <module>
    infer_main(args)
  File "deploy/inference_python/infer.py", line 185, in infer_main
    output = inference_engine.run(data)
  File "deploy/inference_python/infer.py", line 119, in run
    self.predictor.run()
OSError: In user code:

    File "tools/export_model.py", line 123, in <module>
      export(args, cfg)
    File "tools/export_model.py", line 117, in export
      paddle.jit.save(model, os.path.join(args.out_dir, "inference"))
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/jit.py", line 643, in wrapper
      func(layer, path, input_spec, **configs)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/decorator.py", line 232, in fun
      return caller(func, *(extras + args), **kw)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/wrapped_decorator.py", line 26, in __impl__
      return wrapped_func(*args, **kwargs)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/base.py", line 52, in __impl__
      return func(*args, **kwargs)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/jit.py", line 921, in save
      inner_input_spec, with_hook=with_hook)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 578, in concrete_program_specify_input_spec
      is_train=self._is_train_mode())
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 483, in get_concrete_program
      concrete_program, partial_program_layer = self._program_cache[cache_key]
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 953, in __getitem__
      self._caches[item_id] = self._build_once(item)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 942, in _build_once
      **cache_key.kwargs)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/decorator.py", line 232, in fun
      return caller(func, *(extras + args), **kw)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/wrapped_decorator.py", line 26, in __impl__
      return wrapped_func(*args, **kwargs)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/base.py", line 52, in __impl__
      return func(*args, **kwargs)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 886, in from_func_spec
      outputs = static_func(*inputs)
    File "/home/fuqian/Documents/Research/Multi-Modal-Pretraining/2020-UNITER-ECCV/UNITER-Paddle/models/uniter_retrieval.py", line 50, in forward
      if self.training and compute_loss:
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/dygraph_to_static/convert_operators.py", line 320, in convert_ifelse
      return_name_ids)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/dygraph_to_static/convert_operators.py", line 376, in _run_py_ifelse
      py_outs = true_fn() if pred else false_fn()
    File "/home/fuqian/Documents/Research/Multi-Modal-Pretraining/2020-UNITER-ECCV/UNITER-Paddle/models/uniter_retrieval.py", line 58, in forward
      return self.compute_score(batch, compute_loss)
    File "/home/fuqian/Documents/Research/Multi-Modal-Pretraining/2020-UNITER-ECCV/UNITER-Paddle/models/uniter_retrieval.py", line 68, in compute_score
      sequence_output = self.uniter(input_ids, position_ids,
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 950, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/layers.py", line 935, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/home/fuqian/Documents/Research/Multi-Modal-Pretraining/2020-UNITER-ECCV/UNITER-Paddle/models/uniter.py", line 313, in forward
      if input_ids is None:
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/dygraph_to_static/convert_operators.py", line 320, in convert_ifelse
      return_name_ids)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/dygraph_to_static/convert_operators.py", line 376, in _run_py_ifelse
      py_outs = true_fn() if pred else false_fn()
    File "/home/fuqian/Documents/Research/Multi-Modal-Pretraining/2020-UNITER-ECCV/UNITER-Paddle/models/uniter.py", line 317, in forward
      elif img_feat is None:
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/dygraph_to_static/convert_operators.py", line 320, in convert_ifelse
      return_name_ids)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/dygraph/dygraph_to_static/convert_operators.py", line 376, in _run_py_ifelse
      py_outs = true_fn() if pred else false_fn()
    File "/home/fuqian/Documents/Research/Multi-Modal-Pretraining/2020-UNITER-ECCV/UNITER-Paddle/models/uniter.py", line 322, in forward
      embedding_output = self._compute_img_txt_embeddings(
    File "/home/fuqian/Documents/Research/Multi-Modal-Pretraining/2020-UNITER-ECCV/UNITER-Paddle/models/uniter.py", line 298, in _compute_img_txt_embeddings
      embedding_output = paddle_gather(paddle.concat([txt_emb, img_emb], axis=1),
    File "/home/fuqian/Documents/Research/Multi-Modal-Pretraining/2020-UNITER-ECCV/UNITER-Paddle/utils/io_utils.py", line 72, in paddle_gather
      index_flatten = index.flatten()
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/tensor/manipulation.py", line 1497, in flatten
      "stop_axis": stop_axis
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/layer_helper.py", line 45, in append_op
      return self.main_program.current_block().append_op(*args, **kwargs)
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/framework.py", line 3828, in append_op
      attrs=kwargs.get("attrs", None))
    File "/home/fuqian/Downloads/Software/anaconda3/envs/2020-UNITER-ECCV/lib/python3.6/site-packages/paddle/fluid/framework.py", line 2736, in __init__
      for frame in traceback.extract_stack():

    ExternalError: CUDA error(700), an illegal memory access was encountered.
      [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:252)
      [operator < flatten_contiguous_range > error]
terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
  what():  (External) CUDA error(700), an illegal memory access was encountered.
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/fluid/platform/device/gpu/gpu_info.cc:289)

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::AnalysisPredictor::~AnalysisPredictor()
1   paddle::AnalysisPredictor::~AnalysisPredictor()
2   paddle::memory::allocation::StreamSafeCUDAAllocator::ReleaseImpl(phi::Place const&)
3   paddle::memory::allocation::AutoGrowthBestFitAllocator::FreeIdleChunks()
4   paddle::memory::allocation::CUDAAllocator::FreeImpl(phi::Allocation*)
5   paddle::platform::RecordedGpuMallocHelper::Free(void*, unsigned long)

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1663814441 (unix time) try "date -d @1663814441" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x3e80000745b) received by PID 29787 (TID 0x7f417c7a30c0) from PID 29787 ***]

代码仓库: https://github.com/Mixture-of-Rookie/UNITER-Paddle

ToProduce:

# 1. Clone 代码
git clone https://github.com/Mixture-of-Rookie/UNITER-Paddle.git
cd UNITER-Paddle
export PYTHONPATH=$PWD:$PYTHONPATH

# 2. 环境配置
pip install -r requirements.txt
# 安装develop版本的paddlepaddle

# 3. 模型动转静 (success)
python tools/export_model.py --cfg_file configs/retrieval_train_lite.yaml

# 4. 推理 (error)
python deploy/inference_python/infer.py --use-gpu True
paddle-bot[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!