PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.13k stars 5.55k forks source link

Jetson Orin Nano的Linux预编译库安装成功,但是run_check()一直报错 #63244

Closed MarsChange closed 6 months ago

MarsChange commented 6 months ago

bug描述 Describe the Bug

Here is the code

import paddle
paddle.utils.run_check()

Here is the bug.

Running verify PaddlePaddle program ... 
Traceback (most recent call last):
  File "/home/jetson/smartcar/ernie.py", line 79, in <module>
    paddle.utils.run_check()
  File "/home/jetson/.local/lib/python3.8/site-packages/paddle/utils/install_check.py", line 273, in run_check
    _run_static_single(use_cuda, use_xpu, use_custom, custom_device_name)
  File "/home/jetson/.local/lib/python3.8/site-packages/paddle/utils/install_check.py", line 136, in _run_static_single
    param_grads = paddle.static.append_backward(
  File "/home/jetson/.local/lib/python3.8/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/jetson/.local/lib/python3.8/site-packages/paddle/base/wrapped_decorator.py", line 26, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/home/jetson/.local/lib/python3.8/site-packages/paddle/base/framework.py", line 617, in __impl__
    return func(*args, **kwargs)
  File "/home/jetson/.local/lib/python3.8/site-packages/paddle/base/backward.py", line 2215, in append_backward
    _append_backward_vars_(
  File "/home/jetson/.local/lib/python3.8/site-packages/paddle/base/backward.py", line 1788, in _append_backward_vars_
    op_desc.infer_shape(block.desc)
RuntimeError: (NotFound) The kernel `matmul_grad` is not registered.
  [Hint: Expected iter != kernels_.end(), but received iter == kernels_.end().] (at /home/paddle/data/xly/workspace/24116/Paddle/paddle/phi/core/kernel_factory.cc:349)
  [operator < matmul_v2_grad > error]

其他补充信息 Additional Supplementary Information

No response

lappun commented 6 months ago

我也有相同的問題,

我測試了預編譯庫, https://paddle-inference-lib.bj.bcebos.com/2.6.0/python/Jetson/jetpack5.0.2_gcc9.4/orin/paddlepaddle_gpu-2.6.0-cp38-cp38-linux_aarch64.whl

也有測試了在Jetson orin上編譯, 詳情看這裡 https://github.com/lappun/PaddleOCR_testing/blob/master/setup.ipynb

但還是一樣

以下是run_check()的結果 (跑了七分半鐘才報錯)

Running verify PaddlePaddle program ... 
I0406 03:29:15.610915 429179 program_interpreter.cc:212] New Executor is Running.
W0406 03:29:15.611184 429179 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.7, Driver API Version: 11.4, Runtime API Version: 11.4
W0406 03:29:15.616206 429179 gpu_resources.cc:164] device: 0, cuDNN Version: 8.4.
I0406 03:36:26.023401 429179 interpreter_util.cc:624] Standalone Executor is Used.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/paddle/.local/lib/python3.8/site-packages/paddle/utils/install_check.py", line 273, in run_check
    _run_static_single(use_cuda, use_xpu, use_custom, custom_device_name)
  File "/home/paddle/.local/lib/python3.8/site-packages/paddle/utils/install_check.py", line 151, in _run_static_single
    exe.run(
  File "/home/paddle/.local/lib/python3.8/site-packages/paddle/base/executor.py", line 1746, in run
    res = self._run_impl(
  File "/home/paddle/.local/lib/python3.8/site-packages/paddle/base/executor.py", line 1952, in _run_impl
    ret = new_exe.run(
  File "/home/paddle/.local/lib/python3.8/site-packages/paddle/base/executor.py", line 831, in run
    tensors = self._new_exe.run(
OSError: In user code:

    File "<string>", line 1, in <module>

    File "/home/paddle/.local/lib/python3.8/site-packages/paddle/utils/install_check.py", line 273, in run_check
      _run_static_single(use_cuda, use_xpu, use_custom, custom_device_name)
    File "/home/paddle/.local/lib/python3.8/site-packages/paddle/utils/install_check.py", line 135, in _run_static_single
      input, out, weight = _simple_network()
    File "/home/paddle/.local/lib/python3.8/site-packages/paddle/utils/install_check.py", line 37, in _simple_network
      linear_out = paddle.nn.functional.linear(x=input, weight=weight, bias=bias)
    File "/home/paddle/.local/lib/python3.8/site-packages/paddle/nn/functional/common.py", line 1985, in linear
      helper.append_op(
    File "/home/paddle/.local/lib/python3.8/site-packages/paddle/base/layer_helper.py", line 44, in append_op
      return self.main_program.current_block().append_op(*args, **kwargs)
    File "/home/paddle/.local/lib/python3.8/site-packages/paddle/base/framework.py", line 4467, in append_op
      op = Operator(
    File "/home/paddle/.local/lib/python3.8/site-packages/paddle/base/framework.py", line 3016, in __init__
      for frame in traceback.extract_stack():

    ExternalError: CUBLAS error(13). 
      [Hint: 'CUBLAS_STATUS_EXECUTION_FAILED'.  The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons.  To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed. ] (at /paddle/build/Paddle/paddle/phi/kernels/funcs/blas/blas_impl.cu.h:41)
      [operator < matmul_v2_grad > error]

以下是test ctc_loss的結果 (立刻成功)

W0406 03:46:40.800690 429267 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.7, Driver API Version: 11.4, Runtime API Version: 11.4
W0406 03:46:40.806195 429267 gpu_resources.cc:164] device: 0, cuDNN Version: 8.4.
Tensor(shape=[2], dtype=float32, place=Place(gpu:0), stop_gradient=True,
       [3.91798496, 2.90765190])
Tensor(shape=[], dtype=float32, place=Place(gpu:0), stop_gradient=True,
       1.13760614)

以下是test mutmul的結果 (跑了七分半鐘才成功) W0406 03:48:13.081670 429350 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.7, Driver API Version: 11.4, Runtime API Version: 11.4 W0406 03:48:13.086966 429350 gpu_resources.cc:164] device: 0, cuDNN Version: 8.4. [] [10] [10, 5] [10, 5, 5] [10, 3, 5, 5]

EmmonsCurse commented 6 months ago

@MarsChange Jetson 的预测库在编译时会裁剪与推理无关的 kernel,从而减少预测库的体积,参见 #53369 。 此外,matmul_grad 通常是用于计算矩阵乘法操作的梯度,在训练过程中,通过计算损失函数对模型参数的梯度来更新参数,从而使模型逐渐优化;但在推理过程中,也就是模型已经训练好后用于进行预测的过程中,并不需要计算梯度。因此,在推理过程中,通常不会使用到matmul_grad这样的操作,而是使用前向传播的结果直接进行预测,所以这边报错并不会影响使用预测库的使用。

EmmonsCurse commented 6 months ago

@lappun 1、paddle.utils.run_check() 的报错可参见这部分解释

@MarsChange Jetson 的预测库在编译时会裁剪与推理无关的 kernel,从而减少预测库的体积,参见 #53369 。 此外,matmul_grad 通常是用于计算矩阵乘法操作的梯度,在训练过程中,通过计算损失函数对模型参数的梯度来更新参数,从而使模型逐渐优化;但在推理过程中,也就是模型已经训练好后用于进行预测的过程中,并不需要计算梯度。因此,在推理过程中,通常不会使用到matmul_grad这样的操作,而是使用前向传播的结果直接进行预测,所以这边报错并不会影响使用预测库的使用。

2、test_matmul 的问题也做了验证,其中需要注意 API "paddle.device.cuda.stream_guard" is deprecated since 2.5.0, and will be removed in future versions. Please use "paddle.device.stream_guard" instead.

# test_matmul.py
import time
import paddle
paddle.device.set_device('gpu')

s = paddle.device.Stream()

start_time = time.time()
print("start_time:", start_time)

# vector * vector
x = paddle.rand([10])
y = paddle.rand([10])
with paddle.device.stream_guard(s):
    z = paddle.matmul(x, y)
print(z.shape)

# matrix * vector
x = paddle.rand([10, 5])
y = paddle.rand([5])
with paddle.device.stream_guard(s):
    z = paddle.matmul(x, y)
print(z.shape)

# batched matrix * broadcasted vector
x = paddle.rand([10, 5, 2])
y = paddle.rand([2])
with paddle.device.stream_guard(s):
    z = paddle.matmul(x, y)
print(z.shape)

# batched matrix * batched matrix
x = paddle.rand([10, 5, 2])
y = paddle.rand([10, 2, 5])
with paddle.device.stream_guard(s):
    z = paddle.matmul(x, y)
print(z.shape)

# batched matrix * broadcasted matrix
x = paddle.rand([10, 1, 5, 2])
y = paddle.rand([1, 3, 2, 5])
with paddle.device.stream_guard(s):
    z = paddle.matmul(x, y)
print(z.shape)

end_time = time.time()
print("end_time:", end_time)

total_time = end_time - start_time
print("Total run_time: {:.2f} seconds".format(total_time))

针对修改后的单侧,基于我这边的机器(Volta 架构)执行正常,可能需要再看一下是否为机器架构影响:

root@paddle-jp502-2:/home/paddle/data/yubaoku/issue_debug# python test_matmul.py 
start_time: 1712493604.1947494
W0407 20:40:04.195492 1831309 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.2, Driver API Version: 11.4, Runtime API Version: 11.4
W0407 20:40:04.208076 1831309 gpu_resources.cc:164] device: 0, cuDNN Version: 8.4.
[]
[10]
[10, 5]
[10, 5, 5]
[10, 3, 5, 5]
end_time: 1712493605.480865
Total run_time: 1.29 seconds