PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.08k stars 5.55k forks source link

运行paddle失败 #41709

Closed Han-YLun closed 2 years ago

Han-YLun commented 2 years ago

报错信息:

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::framework::SignalHandle(char const*, int)
1   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1649759535 (unix time) try "date -d @1649759535" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x0) received by PID 3552 (TID 0x7f5ee8ef4700) from PID 0 ***]

环境: 使用的docker,镜像为registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda11.2-cudnn8 paddleocr==2.0.1 paddlepaddle-gpu==2.0.1

paddle-bot-old[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

Han-YLun commented 2 years ago
nvidia-smi
Tue Apr 12 18:39:25 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   49C    P8    12W / 250W |      0MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   50C    P8    22W / 250W |      0MiB / 11177MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Han-YLun commented 2 years ago
>>> paddle.utils.run_check()
Running verify PaddlePaddle program ...
W0412 10:46:10.999380  3804 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.4, Runtime API Version: 10.2
W0412 10:46:11.128330  3804 device_context.cc:372] device: 0, cuDNN Version: 8.1.
PaddlePaddle works well on 1 GPU.
2022-04-12 10:46:18,538 - WARNING - PaddlePaddle meets some problem with 2 GPUs. This may be caused by:
 1. There is not enough GPUs visible on your system
 2. Some GPUs are occupied by other process now
 3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests
 to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
2022-04-12 10:46:18,538 - WARNING -
 Original Error is: (External)  Nccl error, unhandled system error  (at /paddle/paddle/fluid/platform/nccl_helper.h:118)

PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.
ceci3 commented 2 years ago

nccl看起来有问题,可以尝试单卡运行程序

Han-YLun commented 2 years ago

怎么单卡运行程序

Han-YLun commented 2 years ago

我使用--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1 创建docker容器时指定了单卡,但是还是这个问题

Han-YLun commented 2 years ago
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::framework::SignalHandle(char const*, int)
1   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1649764241 (unix time) try "date -d @1649764241" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x0) received by PID 1038 (TID 0x7fe62d9b3700) from PID 0 ***]
Han-YLun commented 2 years ago

paddlegpu版本不對: post後面是cuda版本 python -m pip install paddlepaddle-gpu==2.1.1.post112 -f https://paddlepaddle.org.cn/whl/mkl/stable.html

實在不行去 https://www.paddlepaddle.org.cn/whl/mkl/stable.html 下2.0.2版本手動裝

使用这个解决了问题

Han-YLun commented 2 years ago

https://github.com/PaddlePaddle/PaddleOCR/issues/5387