PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.05k stars 5.54k forks source link

paddle.utils.run_check() 提示 段错误 (核心已转储) 救救孩子 #38266

Closed ToscanaGoGithub closed 1 year ago

ToscanaGoGithub commented 2 years ago

系统版本: Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

linux 什么版本: linux-image-5.11.0-27-generic

cuda版本: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Mon_Oct_12_20:09:46_PDT_2020 Cuda compilation tools, release 11.1, V11.1.105 Build cuda_11.1.TC455_06.29190527_0

显卡驱动版本: nvidia-dkms-470 470.86-0ubuntu0.20.04.1 amd64 NVIDIA DKMS package nvidia-driver-470 470.86-0ubuntu0.20.04.1 amd64 NVIDIA driver metapackage

执行 paddle.utils.run_check() 时报错信息: Running verify PaddlePaddle program ... W1219 20:07:08.779233 31712 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 11.2


C++ Traceback (most recent call last):

No stack trace in paddle, may be caused by external reasons.


Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1639915628 (unix time) try "date -d @1639915628" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x0) received by PID 31712 (TID 0x7f2d1f1c4740) from PID 0 ]

段错误 (核心已转储)

paddle-bot-old[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

ToscanaGoGithub commented 2 years ago

python版本: Python 3.8.8 (default, Apr 13 2021, 19:58:26) [GCC 7.3.0] :: Anaconda, Inc. on linux

paddle 版本: paddlepaddle-gpu==2.2.1.post111

XieYunshen commented 2 years ago

cudnn版本是多少呢?

RainFrost1 commented 2 years ago

在运行之前,先设置环境变量export FLAGS_call_stack_level=2,然后运行,看一下具体的报错代码

ToscanaGoGithub commented 2 years ago

zhangwei@2080s:~$ export FLAGS_call_stack_level=2 zhangwei@2080s:~$ python Python 3.8.8 (default, Apr 13 2021, 19:58:26) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

import paddle paddle.utils.run_check() Running verify PaddlePaddle program ... W1220 16:59:03.243978 390281 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 11.2


C++ Traceback (most recent call last):

No stack trace in paddle, may be caused by external reasons.


Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1639990743 (unix time) try "date -d @1639990743" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x0) received by PID 390281 (TID 0x7f8ee93c9740) from PID 0 ]

段错误 (核心已转储) zhangwei@2080s:~$

设置之后,执行时还是同样的问题

ToscanaGoGithub commented 2 years ago

import torch print(torch.backends.cudnn.version()) 8005

通过pytorch查询是8005,但是通过以下语句查询时,无打印结果 cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

RainFrost1 commented 2 years ago

换其他版本的paddle试试呢?

ToscanaGoGithub commented 2 years ago

几个版本都尝试过,都是报同样的错误

RainFrost1 commented 2 years ago

使用cpu版本也会报错吗?

ToscanaGoGithub commented 2 years ago

使用cpu版本也会报错吗?

import paddle paddle.utils.run_check() Running verify PaddlePaddle program ... PaddlePaddle works well on 1 CPU. W1221 10:40:38.772401 430183 fuse_all_reduce_op_pass.cc:76] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 2. PaddlePaddle works well on 2 CPUs. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

试了一下 CPU的没问题

RainFrost1 commented 2 years ago

感觉应该是cuda、cudnn等环境配置问题,这样就不太好定位了。可以使用一下paddle官网的docker环境

ToscanaGoGithub commented 2 years ago

感觉应该是cuda、cudnn等环境配置问题,这样就不太好定位了。可以使用一下paddle官网的docker环境

好的 感谢回复

YFAN1020 commented 6 months ago

感觉应该是cuda、cudnn等环境配置问题,这样就不太好定位了。可以使用一下paddle官网的docker环境

好的 感谢回复

请问解决了吗