paddlepaddle-gpu 2.0.0rc1报FatalError: `Segmentation fault` is detected by the operating system.

xiulianzw commented 3 years ago

用的git上的最新版的PaddleOCR，在执行python tools/infer/predict_system.py报错，错误信息如下：

C++ Traceback (most recent call last):

0 paddle::framework::SignalHandle(char const*, int) 1 paddle::platform::GetCurrentTraceBackString()

Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1609724467 (unix time) try "date -d @1609724467" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x0) received by PID 127353 (TID 0x7f4aa7f1d700) from PID 0 ]

Segmentation fault (core dumped)

执行paddle.utils.run_check()的信息如下：

import paddle paddle.utils.run_check() Running verify PaddlePaddle program ... W0104 09:50:08.441300 127586 device_context.cc:320] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 10.2, Runtime API Version: 10.0 W0104 09:50:08.444324 127586 device_context.cc:330] device: 0, cuDNN Version: 8.0. PaddlePaddle works well on 1 GPU. W0104 09:50:10.058878 127586 parallel_executor.cc:491] Cannot enable P2P access from 0 to 2 W0104 09:50:10.058951 127586 parallel_executor.cc:491] Cannot enable P2P access from 0 to 3 W0104 09:50:10.799384 127586 parallel_executor.cc:491] Cannot enable P2P access from 1 to 2 W0104 09:50:10.799430 127586 parallel_executor.cc:491] Cannot enable P2P access from 1 to 3 W0104 09:50:10.799440 127586 parallel_executor.cc:491] Cannot enable P2P access from 2 to 0 W0104 09:50:10.799450 127586 parallel_executor.cc:491] Cannot enable P2P access from 2 to 1 W0104 09:50:11.883519 127586 parallel_executor.cc:491] Cannot enable P2P access from 3 to 0 W0104 09:50:11.883584 127586 parallel_executor.cc:491] Cannot enable P2P access from 3 to 1 W0104 09:50:15.108191 127586 fuse_all_reduce_op_pass.cc:75] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 2. PaddlePaddle works well on 4 GPUs. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

环境信息： python版本3.8.5，3.7的也测试过一样的错误

Package Version

alabaster 0.7.12 anaconda-client 1.7.2 anaconda-navigator 1.9.12 anaconda-project 0.8.3 appdirs 1.4.4 asn1crypto 1.4.0 astor 0.8.1 astroid 2.4.2 astropy 4.0.1.post1 atomicwrites 1.4.0 attrs 20.1.0 Babel 2.8.0 backcall 0.2.0 backports.functools-lru-cache 1.6.1 backports.shutil-get-terminal-size 1.0.0 backports.tempfile 1.0 backports.weakref 1.0.post1 bce-python-sdk 0.8.53 beautifulsoup4 4.9.1 bitarray 1.5.3 bkcharts 0.2 bokeh 2.2.1 boto 2.49.0 Bottleneck 1.3.2 brotlipy 0.7.0 certifi 2020.6.20 cffi 1.14.2 cfgv 3.2.0 chardet 3.0.4 cliapp 1.0.9 click 7.1.2 cloudpickle 1.6.0 clyent 1.2.2 colorama 0.4.3 conda 4.8.4 conda-build 3.20.2 conda-package-handling 1.7.0 conda-verify 3.4.2 contextlib2 0.6.0.post1 cryptography 3.1 cycler 0.10.0 Cython 0.29.21 cytoolz 0.10.1 dask 2.25.0 datashape 0.5.4 decorator 4.4.2 distlib 0.3.1 distributed 2.25.0 docutils 0.16 entrypoints 0.3 et-xmlfile 1.0.1 fastcache 1.1.0 filelock 3.0.12 flake8 3.8.4 Flask 1.1.2 Flask-Babel 2.0.0 Flask-Cors 3.0.9 fsspec 0.8.0 future 0.18.2 gast 0.3.3 gevent 20.6.2 glob2 0.7 gmpy2 2.0.8 greenlet 0.4.16 h5py 2.10.0 HeapDict 1.0.1 html5lib 1.1 hypothesis 5.29.0 identify 1.5.10 idna 2.10 imageio 2.9.0 imagesize 1.2.0 imgaug 0.4.0 importlib-metadata 1.7.0 ipykernel 5.3.4 ipython 7.18.1 ipython-genutils 0.2.0 isort 5.4.2 itsdangerous 1.1.0 jdcal 1.4.1 jedi 0.17.2 Jinja2 2.11.2 joblib 0.16.0 jsonschema 3.2.0 jupyter-client 6.1.6 jupyter-console 6.2.0 jupyter-core 4.6.3 kiwisolver 1.2.0 lazy-object-proxy 1.4.3 libarchive-c 2.9 llvmlite 0.34.0 lmdb 1.0.0 locket 0.2.0 lxml 4.5.2 MarkupSafe 1.1.1 matplotlib 3.3.1 mccabe 0.6.1 mistune 0.8.4 mkl-fft 1.1.0 mkl-random 1.1.1 mkl-service 2.3.0 mock 4.0.2 more-itertools 8.5.0 mpmath 1.1.0 msgpack 1.0.0 multipledispatch 0.6.0 navigator-updater 0.2.1 nbformat 5.0.7 networkx 2.5 nltk 3.5 nodeenv 1.5.0 nose 1.3.7 numba 0.51.2 numexpr 2.7.1 numpy 1.19.1 numpydoc 1.1.0 odo 0.5.1 olefile 0.46 opencv-python 4.2.0.32 openpyxl 3.0.5 packaging 20.4 paddlepaddle-gpu 2.0.0rc1.post100 pandas 1.1.1 pandocfilters 1.4.2 parso 0.7.0 partd 1.1.0 path 15.0.0 pathlib2 2.3.5 patsy 0.5.1 pep8 1.7.1 pexpect 4.8.0 pickleshare 0.7.5 Pillow 7.2.0 pip 20.2.2 pkginfo 1.5.0.1 pluggy 0.13.1 ply 3.11 pre-commit 2.9.3 prompt-toolkit 3.0.7 protobuf 3.14.0 psutil 5.7.2 ptyprocess 0.6.0 py 1.9.0 pyclipper 1.2.1 pycodestyle 2.6.0 pycosat 0.6.3 pycparser 2.20 pycrypto 2.6.1 pycryptodome 3.9.9 pycurl 7.43.0.5 pyflakes 2.2.0 Pygments 2.6.1 pylint 2.6.0 pyodbc 4.0.0-unsupported pyOpenSSL 19.1.0 pyparsing 2.4.7 pyrsistent 0.16.0 PySocks 1.7.1 pytest 5.0.0 pytest-arraydiff 0.2 pytest-astropy 0.8.0 pytest-astropy-header 0.1.2 pytest-doctestplus 0.8.0 pytest-openfiles 0.5.0 pytest-remotedata 0.3.2 python-dateutil 2.8.1 python-Levenshtein 0.12.0 pytz 2020.1 PyWavelets 1.1.1 PyYAML 5.3.1 pyzmq 18.1.1 QtAwesome 0.7.2 qtconsole 4.7.6 QtPy 1.9.0 regex 2020.7.14 requests 2.24.0 rope 0.17.0 ruamel-yaml 0.15.87 scikit-image 0.16.2 scikit-learn 0.23.2 scipy 1.5.2 seaborn 0.10.1 Send2Trash 1.5.0 setuptools 49.6.0.post20200814 Shapely 1.7.1 simplegeneric 0.8.1 singledispatch 3.4.0.3 sip 4.19.13 six 1.15.0 snowballstemmer 2.0.0 sortedcollections 1.2.1 sortedcontainers 2.2.2 soupsieve 2.0.1 Sphinx 3.2.1 sphinxcontrib-applehelp 1.0.2 sphinxcontrib-devhelp 1.0.2 sphinxcontrib-htmlhelp 1.0.3 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.3 sphinxcontrib-serializinghtml 1.1.4 sphinxcontrib-websupport 1.2.4 SQLAlchemy 1.3.19 statsmodels 0.11.1 sympy 1.5.1 tables 3.6.1 tblib 1.7.0 terminado 0.8.3 testpath 0.4.4 threadpoolctl 2.1.0 toml 0.10.1 toolz 0.10.0 tornado 6.0.4 tqdm 4.48.2 traitlets 4.3.3 typing-extensions 3.7.4.3 unicodecsv 0.14.1 urllib3 1.25.10 virtualenv 20.2.2 visualdl 2.1.0 wcwidth 0.2.5 webencodings 0.5.1 Werkzeug 1.0.1 wheel 0.35.1 wrapt 1.11.2 xlrd 1.2.0 XlsxWriter 1.3.3 xlwt 1.3.0 xmltodict 0.12.0 zict 2.0.0 zipp 3.1.0 zope.event 4.4 zope.interface 5.1.0

用之前的版本，安装1.8.5的测试没有问题

duohaoxue commented 3 years ago

我的也是，请问你解决了嘛？

xiulianzw commented 3 years ago

CPU版本的能用，或者你用之前版本的，装1.8.5也能用，估计是最新版的一个bug吧 @duohaoxue

yxd117 commented 3 years ago

InvalidArgumentError: The input tensor's dimension should be equal to the axis's size. But received input tensor's dimension is 4, axis's size is 3 [Hint: Expected x_rank == axis_size, but received x_rank:4 != axis_size:3.] (at /paddle/paddle/fluid/operators/transpose_op.cc:47) [Hint: If you need C++ stacktraces for debugging, please setFLAGS_call_stack_level=2.] 我也有同样的问题, 加了这个命令 --use_gpu=False 出现以上error

WenmuZhou commented 3 years ago

@xiulianzw cuda和cudnn环境是啥，跑的动态图版本吗

xiulianzw commented 3 years ago

你装了CPU版本的么？ @yxd117

xiulianzw commented 3 years ago

GPU版本的，CPU版本我测过没问题 @WenmuZhou

yxd117 commented 3 years ago

你装了CPU版本的么？ @yxd117 GPU 版本的 '2.0.0-rc1' 应该是同样的问题

yangy996 commented 3 years ago

我也是报这个错误，你解决了吗？

xiulianzw commented 3 years ago

你输出一下你paddlepaddle-gpu的安装信息，看看cudnn的版本是不是7.6.5的。 @YY007H

yxd117 commented 3 years ago

你输出一下你paddlepaddle-gpu的安装信息，看看cudnn的版本是不是7.6.5的。 @YY007H

多谢兄弟我把我的cudnn从8.0.5 downgrade成7.6.5就没有这个error

yangy996 commented 3 years ago

我把cuda升级到11.0，cudnn升级到8.0，然后可以了。。。

yxd117 commented 3 years ago

我的cuda10.2 + cudnn8.0.5 不行 cuda10.2 + cudnn 7.6.5 没问题不知道是不是我cuda8哪里装错了

wa3926 commented 3 years ago

@YY007H 我cuda11.0 cudnn8.0 不行请问你操作系统是多少

yangy996 commented 3 years ago

@YY007H 我cuda11.0 cudnn8.0 不行请问你操作系统是多少

我用的ubuntu20.04，python是3.8版本的

wa3926 commented 3 years ago

@YY007H 我是centos7.9的服务器显卡驱动版本是 11.2的我不知道是不是驱动问题这个问题bug搞了几天了一个老的服务器 cuda10.2 cudnn7.6 就没问题这错不知道还有没有其他办法 C++ Traceback (most recent call last):

0 paddle::framework::SignalHandle(char const*, int) 1 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1611540174 (unix time) try "date -d @1611540174" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x0) received by PID 3564 (TID 0x7f8c1e82f740) from PID 0 ] 再搞不定估计要重装系统了

yangy996 commented 3 years ago

@YY007H 我是centos7.9的服务器显卡驱动版本是 11.2的我不知道是不是驱动问题这个问题bug搞了几天了一个老的服务器 cuda10.2 cudnn7.6 就没问题这错不知道还有没有其他办法 C++ Traceback (most recent call last):

0 paddle::framework::SignalHandle(char const*, int) 1 paddle::platform::GetCurrentTraceBackStringabi:cxx11

Error Message Summary:

FatalError: Segmentation fault is detected by the operating system. [TimeInfo: Aborted at 1611540174 (unix time) try "date -d @1611540174" if you are using GNU date ] [SignalInfo: SIGSEGV (@0x0) received by PID 3564 (TID 0x7f8c1e82f740) from PID 0 ] 再搞不定估计要重装系统了

我没记错的话，centos7.9不能装这么高的版本，你需要降级

jey07 commented 3 years ago

I am also getting similar error for below version:

W0217 12:22:39.872664  1972 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.2, Runtime API Version: 10.2
W0217 12:22:40.391552  1972 device_context.cc:372] device: 0, cuDNN Version: 8.1.

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::framework::SignalHandle(char const*, int)
1   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()
----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1613564641 (unix time) try "date -d @1613564641" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x0) received by PID 1972 (TID 0x7f344ad72740) from PID 0 ***]
Segmentation fault (core dumped)

xiulianzw commented 3 years ago

是cudnn版本的问题，换成7.5的试试看吧？我后面也是换了cudnn的版本就没问题了

jey07 commented 3 years ago

I am running in google cloud vm instance. So I am not sure if can change the Cudnn version..

xiulianzw commented 3 years ago

你现在用的是2.0.0稳定版本吗？我之前用的是2.0rc1版本的，也许官方已经在稳定版上修正了。如果是Google云，不知道能不能安装cuda，如果可以的话，你就自己再安装一下cuda和cudnn吧，然后在~/.bashrc配置一下。如果只是部署项目，可以考虑直接使用docker镜像。

jey07 commented 3 years ago

I am not deploying . I am trying to train the model. With CPU, everything works fine. Any idea how long the training of images take with CPU ?

speaknowpotato commented 3 years ago

cuda 10.2 + libcudnn 7.6.5.32 可以工作

thunder95 commented 3 years ago

我也遇到同样的问题，cuda 10.2 + libcudnn 8，请问大佬们怎么解决的

jey07 commented 3 years ago

So, paddlepaddle-ocr supports 10.2 cuda with 7.6.5 cudnn

hebo1982 commented 3 years ago

测试是cuda和cudnn搭配的paddlepaddle不匹配。 https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/windows-pip.html#cuda10.2

安装对应的paddlepaddle 就好了

huanli2012 commented 3 years ago

我的cuda10.2 + cudnn8.0.5 不行 cuda10.2 + cudnn 7.6.5 没问题不知道是不是我cuda8哪里装错了

cuda10.2 + cudnn 7.6.5 还是报错

D-DanielYang commented 3 years ago

我的cuda10.2 + cudnn8.0.5 不行 cuda10.2 + cudnn 7.6.5 没问题不知道是不是我cuda8哪里装错了

cuda10.2 + cudnn 7.6.5 还是报错

应该就是cuda和cudnn版本配合的问题，可以换几种组合试试

thongvhoang commented 3 years ago

FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1628103340 (unix time) try "date -d @1628103340" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x58564a3239) received by PID 721 (TID 0x7f7e29429780) from PID 1447703097 ***]

I had same this problems. How to fix this ? I use command:

!python3 tools/infer_det.py -c configs/det/det_r50_vd_east.yml -o Global.infer_img=$public_dataset_dir \
    Global.pretrained_model="/content/drive/My Drive/Colab_Notebook/text_scence_detection/PaddleOCR/output/det_r50_vd_east_v2.0_train/best_accuracy"

Evezerest commented 3 years ago

cudnn和cuda版本的问题吧

FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1628103340 (unix time) try "date -d @1628103340" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x58564a3239) received by PID 721 (TID 0x7f7e29429780) from PID 1447703097 ***]

I had same this problems. How to fix this ? I use command:

!python3 tools/infer_det.py -c configs/det/det_r50_vd_east.yml -o Global.infer_img=$public_dataset_dir \
    Global.pretrained_model="/content/drive/My Drive/Colab_Notebook/text_scence_detection/PaddleOCR/output/det_r50_vd_east_v2.0_train/best_accuracy"

yanzheng636 commented 3 years ago

我把cuda升级到11.0，cudnn升级到8.0，然后可以了。。。你好我的就是这个环境但是还是这个错误

HuAndrew commented 2 years ago

cuda11，cudnn8.0安装paddlepaddle-gpu==2.0.0会出现上述问题，把paddlepaddle-gpu包参考官网的教程安装成2.2.1版本没有问题。

# https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html
python -m pip install paddlepaddle-gpu==2.2.1.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

paddle-bot-old[bot] commented 2 years ago

Since you haven\'t replied for more than 3 months, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. It is recommended to pull and try the latest code first. 由于您超过三个月未回复，我们将关闭这个issue/pr。若问题未解决或有后续问题，请随时重新打开（建议先拉取最新代码进行尝试），我们会继续跟进。

EchoYGemini commented 2 years ago

我的docker环境cuda11.3，cudnn8.2装paddlepaddle==2.3.0也有这个问题，用下面的命令重新安装解决了。 python -m pip install paddlepaddle-gpu==2.2.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

Turnsole1 commented 2 years ago

一定要按照这个链接里的https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/windows-pip.html#cuda10.2里的 GPU版的PaddlePaddle 板块针对不同版本的cuda下载paddlepaddle。不然pip命令默认安装的是cuda 10.2版本的！

PaddlePaddle / PaddleOCR