PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
21.66k stars 5.44k forks source link

安装的时候报错The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. #55715

Closed liangshu-code closed 9 months ago

liangshu-code commented 9 months ago

问题描述 Issue Description

运行后paddle.utils.run_check()报错:

import paddle paddle.utils.run_check() Running verify PaddlePaddle program ... I0726 14:31:17.682049 212581 interpretercore.cc:237] New Executor is Running. W0726 14:31:17.682821 212581 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0726 14:31:17.682865 212581 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 12.2, Runtime API Version: 11.7 W0726 14:31:17.839526 212581 gpu_resources.cc:149] device: 0, cuDNN Version: 8.4. I0726 14:31:27.508836 212581 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

版本&环境信息 Version & Environment Information

Paddle version: 2.5.0 Paddle With CUDA: True

OS: ubuntu 20.04 GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: N/A CMake version: N/A Libc version: glibc 2.31 Python version: 3.8.0

CUDA version: 11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0 cuDNN version: N/A Nvidia driver version: 535.54.03 Nvidia driver List: GPU 0: Quadro P5000

YanhuiDua commented 9 months ago

你好,这个是warning,安装没有问题,使用的话遇到问题的话可以再提问,参考:https://github.com/PaddlePaddle/Paddle/issues/54713

chuwang9964 commented 9 months ago

我也遇到这个问题,在执行训练的时候停止了

YanhuiDua commented 9 months ago

我也遇到这个问题,在执行训练的时候停止了

你好,请问训练停止报什么错误呢?

GM5GM5 commented 9 months ago

开始训练到这一步就直接停止了。 W0816 11:37:29.171767 14600 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0816 11:37:29.172734 14600 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 12.2, Runtime API Version: 11.2 W0816 11:37:29.177718 14600 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. [2023/08/16 11:37:29] ppocr INFO: train dataloader has 18 iters [2023/08/16 11:37:29] ppocr INFO: valid dataloader has 48 iters [2023/08/16 11:37:29] ppocr INFO: load pretrain successful from ./pretrain_models/ch_ppocr_server_v2.0_det_train/best_accuracy [2023/08/16 11:37:29] ppocr INFO: During the training process, after the 3000th iteration, an evaluation is run every 2000 iterations

YanhuiDua commented 9 months ago

开始训练到这一步就直接停止了。 W0816 11:37:29.171767 14600 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0816 11:37:29.172734 14600 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 12.2, Runtime API Version: 11.2 W0816 11:37:29.177718 14600 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. [2023/08/16 11:37:29] ppocr INFO: train dataloader has 18 iters [2023/08/16 11:37:29] ppocr INFO: valid dataloader has 48 iters [2023/08/16 11:37:29] ppocr INFO: load pretrain successful from ./pretrain_models/ch_ppocr_server_v2.0_det_train/best_accuracy [2023/08/16 11:37:29] ppocr INFO: During the training process, after the 3000th iteration, an evaluation is run every 2000 iterations

你好,这个log没有报错信息,麻烦提供下你的paddle版本,以及运行命令

GM5GM5 commented 9 months ago

我解决了这个问题,是paddle版本的问题,我是直接在官网复制的命令用pip安装的,后来我选择下载包自己编译,解决了这个问题,能够成功训练了。原版本是2.6.1,命令是python tools/train.py -c configs/det/det_mv3_db.yml

BaoyuLi12138 commented 8 months ago

[2023-08-31 15:34:06,387] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'uie-senta-base'. [2023-08-31 15:34:06,387] [ INFO] - Already cached /root/.paddlenlp/models/uie-senta-base/ernie_3.0_base_zh_vocab.txt [2023-08-31 15:34:06,410] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/uie-senta-base/tokenizer_config.json [2023-08-31 15:34:06,411] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/uie-senta-base/special_tokens_map.json [2023-08-31 15:34:06,412] [ INFO] - Already cached /root/.paddlenlp/models/uie-senta-base/model_state.pdparams [2023-08-31 15:34:06,412] [ INFO] - Loading weights file model_state.pdparams from cache at /root/.paddlenlp/models/uie-senta-base/model_state.pdparams [2023-08-31 15:34:06,883] [ INFO] - Loaded weights file from disk, setting weights to model. W0831 15:34:06.887965 1829614 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0831 15:34:06.887992 1829614 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.7 W0831 15:34:06.891219 1829614 gpu_resources.cc:149] device: 0, cuDNN Version: 7.6.

进程已结束,退出代码 134 我是遇到这个问题就直接停掉了 是否是需要更新cuDNN的版本呢?

BaoyuLi12138 commented 8 months ago

paddle-bfloat 0.1.7 paddle2onnx 1.0.9 paddlefsl 1.1.0 paddlenlp 2.6.0 paddlepaddle 2.5.1 paddlepaddle-gpu 2.5.1.post117 python ==3.9.12 这是我对应paddle的版本~

YanhuiDua commented 8 months ago

[2023-08-31 15:34:06,387] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'uie-senta-base'. [2023-08-31 15:34:06,387] [ INFO] - Already cached /root/.paddlenlp/models/uie-senta-base/ernie_3.0_base_zh_vocab.txt [2023-08-31 15:34:06,410] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/uie-senta-base/tokenizer_config.json [2023-08-31 15:34:06,411] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/uie-senta-base/special_tokens_map.json [2023-08-31 15:34:06,412] [ INFO] - Already cached /root/.paddlenlp/models/uie-senta-base/model_state.pdparams [2023-08-31 15:34:06,412] [ INFO] - Loading weights file model_state.pdparams from cache at /root/.paddlenlp/models/uie-senta-base/model_state.pdparams [2023-08-31 15:34:06,883] [ INFO] - Loaded weights file from disk, setting weights to model. W0831 15:34:06.887965 1829614 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0831 15:34:06.887992 1829614 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.7 W0831 15:34:06.891219 1829614 gpu_resources.cc:149] device: 0, cuDNN Version: 7.6.

进程已结束,退出代码 134 我是遇到这个问题就直接停掉了 是否是需要更新cuDNN的版本呢?

尝试下运行 python -c "import paddle;paddle.utils.run_check()",看下输出是否正常

BaoyuLi12138 commented 8 months ago

我现在用了官网docker的环境:nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.5.1-gpu-cuda11.7-cudnn8.4-trt8.4

W0831 10:10:01.869431 1333 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0831 10:10:01.869459 1333 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.7 W0831 10:10:02.049255 1333 gpu_resources.cc:149] device: 0, cuDNN Version: 8.4. terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device 目前出现的是这个问题

对应的环境是: paddle-bfloat 0.1.7 paddle2onnx 1.0.9 paddlefsl 1.1.0 paddlenlp 2.6.0 paddlepaddle-gpu 2.5.1.post117

对应的显卡是1080ti 可能是模型不支持嘛?

YanhuiDua commented 8 months ago

你好,这个问题已经收到,我们看下

YanhuiDua commented 8 months ago

可以先尝试下使用低版本CUDA的镜像和whl包测试下

BaoyuLi12138 commented 8 months ago

[2023-09-01 03:33:44,390] [ INFO] - Loading weights file model_state.pdparams from cache at /root/.paddlenlp/models/uie-senta-base/model_state.pdparams [2023-09-01 03:33:45,324] [ INFO] - Loaded weights file from disk, setting weights to model. W0901 03:33:45.331705 1222 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0901 03:33:45.331734 1222 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.2 W0901 03:33:45.335844 1222 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

Aborted (core dumped) 目前是这个问题 我网上查了一下 好像是核心已转储 说是调小batch_size 我现在把batch_size调整为4 但是目前还是会报错 是因为我用的机器的配置有问题吗? 我目前的命令行为: python finetune.py \ --train_path ./data/train.json \ --dev_path ./data/dev.json \ --save_dir ./checkpoint \ --learning_rate 1e-5 \ --batch_size 4 \ --max_seq_len 512 \ --num_epochs 3 \ --model uie-senta-base \ --seed 1000 \ --logging_steps 10 \ --valid_steps 100 \ --device gpu 目前的包版本为: addle-bfloat 0.1.7
paddle2onnx 1.0.9
paddlefsl 1.1.0
paddlenlp 2.6.0
paddlepaddle-gpu 2.5.1.post112

BaoyuLi12138 commented 8 months ago

命令行改换为: python -u -m paddle.distributed.launch --gpus "0" finetune.py --train_path ./data/train.json --dev_path ./data/dev.json --save_dir ./checkpoint --learning_rate 1e-5 --batch_size 4 --max_seq_len 512 --num_epochs 3 --model uie-senta-base --seed 1000 --logging_steps 10 --valid_steps 100 --device gpu

现在会出现: [2023-09-04 05:42:17,598] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'uie-senta-base'. [2023-09-04 05:42:17,599] [ INFO] - Already cached /root/.paddlenlp/models/uie-senta-base/ernie_3.0_base_zh_vocab.txt [2023-09-04 05:42:17,630] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/uie-senta-base/tokenizer_config.json [2023-09-04 05:42:17,630] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/uie-senta-base/special_tokens_map.json [2023-09-04 05:42:17,631] [ INFO] - Already cached /root/.paddlenlp/models/uie-senta-base/model_state.pdparams [2023-09-04 05:42:17,632] [ INFO] - Loading weights file model_state.pdparams from cache at /root/.paddlenlp/models/uie-senta-base/model_state.pdparams [2023-09-04 05:42:18,228] [ INFO] - Loaded weights file from disk, setting weights to model. W0904 05:42:18.233268 1611 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0904 05:42:18.233296 1611 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.2 W0904 05:42:18.236320 1611 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device LAUNCH INFO 2023-09-04 05:42:20,033 Pod failed LAUNCH ERROR 2023-09-04 05:42:20,034 Container failed !!! Container rank 0 status failed cmd ['/usr/bin/python', '-u', 'finetune.py'] code -6 log log/workerlog.0 env {'GREP_COLOR': '1;31', 'LC_ALL': 'en_US.UTF-8', 'SSH_CONNECTION': '192.168.4.93 65019 172.17.0.2 22', 'LANG': 'en_US.UTF-8', 'USER': 'root', 'PWD': '/paddle/PaddleNLP/applications/sentiment_analysis/unified_sentiment_extraction', 'HOME': '/root', 'CLICOLOR': '1', 'SSH_CLIENT': '192.168.4.93 65019 22', 'GREP_OPTIONS': '--color=auto', 'SSH_TTY': '/dev/pts/1', 'MAIL': '/var/mail/root', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'SHLVL': '1', 'LANGUAGE': 'enUS.UTF-8', 'LOGNAME': 'root', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'PS1': '[\033[1;33m]λ [\033[1;37m]\h [\033[1;32m]\w [\033[0m]', '': '/usr/bin/python', 'OLDPWD': '/paddle/PaddleNLP/applications/sentiment_analysis', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'sygxoa', 'PADDLE_MASTER': '172.17.0.2:40366', 'PADDLE_GLOBAL_SIZE': '1', 'PADDLE_LOCAL_SIZE': '1', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_TRAINER_ENDPOINTS': '172.17.0.2:40367', 'PADDLE_CURRENT_ENDPOINT': '172.17.0.2:40367', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '1', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '0'} LAUNCH INFO 2023-09-04 05:42:20,034 ------------------------- ERROR LOG DETAIL ------------------------- grep: warning: GREP_OPTIONS is deprecated; please use an alias or script [2023-09-04 05:42:17,598] [ INFO] - We are using (<class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'>, False) to load 'uie-senta-base'. [2023-09-04 05:42:17,599] [ INFO] - Already cached /root/.paddlenlp/models/uie-senta-base/ernie_3.0_base_zh_vocab.txt [2023-09-04 05:42:17,630] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/uie-senta-base/tokenizer_config.json [2023-09-04 05:42:17,630] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/uie-senta-base/special_tokens_map.json [2023-09-04 05:42:17,631] [ INFO] - Already cached /root/.paddlenlp/models/uie-senta-base/model_state.pdparams [2023-09-04 05:42:17,632] [ INFO] - Loading weights file model_state.pdparams from cache at /root/.paddlenlp/models/uie-senta-base/model_state.pdparams [2023-09-04 05:42:18,228] [ INFO] - Loaded weights file from disk, setting weights to model. W0904 05:42:18.233268 1611 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0904 05:42:18.233296 1611 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.2 W0904 05:42:18.236320 1611 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device LAUNCH INFO 2023-09-04 05:42:20,035 Exit code -6

YanhuiDua commented 8 months ago

你好,如果只运行paddle.utils.run_check()就会出现这个“cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device”报错的话,那就与运行命令/模型配置都无关,是paddle安装的问题;建议可以降低下cuda版本或者源码编译下

BaoyuLi12138 commented 8 months ago

您好 我按照您之前给我说的 单独运行了 python -c "import paddle;paddle.utils.run_check()"

目前出现的问题是: grep: warning: GREP_OPTIONS is deprecated; please use an alias or script Running verify PaddlePaddle program ... I0904 08:32:01.975775 1720 interpretercore.cc:237] New Executor is Running. W0904 08:32:01.976156 1720 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0904 08:32:01.976168 1720 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.2 W0904 08:32:01.979440 1720 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. I0904 08:32:04.316855 1720 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. grep: warning: GREP_OPTIONS is deprecated; please use an alias or script grep: warning: GREP_OPTIONS is deprecated; please use an alias or script grep: warning: GREP_OPTIONS is deprecated; please use an alias or script grep: warning: GREP_OPTIONS is deprecated; please use an alias or script ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='') I0904 08:32:05.960580 1767 tcp_utils.cc:107] Retry to connect to 127.0.0.1:34881 while the server is not yet listening. ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='') I0904 08:32:05.967067 1763 tcp_utils.cc:107] Retry to connect to 127.0.0.1:34881 while the server is not yet listening. ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='2', default_value='') I0904 08:32:05.975247 1765 tcp_utils.cc:107] Retry to connect to 127.0.0.1:34881 while the server is not yet listening. ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='') I0904 08:32:06.002357 1761 tcp_utils.cc:181] The server starts to listen on IP_ANY:34881 I0904 08:32:06.002689 1761 tcp_utils.cc:130] Successfully connected to 127.0.0.1:34881 I0904 08:32:08.960848 1767 tcp_utils.cc:130] Successfully connected to 127.0.0.1:34881 I0904 08:32:08.967320 1763 tcp_utils.cc:130] Successfully connected to 127.0.0.1:34881 I0904 08:32:08.975486 1765 tcp_utils.cc:130] Successfully connected to 127.0.0.1:34881 W0904 08:32:09.957010 1761 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0904 08:32:09.957083 1761 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.2 W0904 08:32:09.960577 1761 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. W0904 08:32:09.998554 1763 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0904 08:32:09.998610 1763 gpu_resources.cc:119] Please NOTE: device: 1, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.2 W0904 08:32:09.998611 1767 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0904 08:32:09.998682 1767 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.2 W0904 08:32:09.998685 1765 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0904 08:32:09.998782 1765 gpu_resources.cc:119] Please NOTE: device: 2, GPU Compute Capability: 6.1, Driver API Version: 11.7, Runtime API Version: 11.2 W0904 08:32:10.005148 1763 gpu_resources.cc:149] device: 1, cuDNN Version: 8.2. W0904 08:32:10.005192 1765 gpu_resources.cc:149] device: 2, cuDNN Version: 8.2. W0904 08:32:10.005231 1767 gpu_resources.cc:149] device: 3, cuDNN Version: 8.2. Failed, NCCL error ../paddle/fluid/distributed/collective/process_group_nccl.cc:660 'unhandled system error' Failed, NCCL error ../paddle/fluid/distributed/collective/process_group_nccl.cc:660 'unhandled system error'

C++ Traceback (most recent call last): 0 paddle::distributed::ProcessGroupNCCL::Barrier(paddle::distributed::BarrierOptions const&) 1 paddle::distributed::ProcessGroupNCCL::AllReduce(phi::DenseTensor, phi::DenseTensor const&, paddle::distributed::AllreduceOptions const&, bool, bool) 2 paddle::distributed::ProcessGroupNCCL::RunFnInNCCLEnv(std::function<void (ncclComm, CUstream_st*)>, phi::DenseTensor const&, paddle::distributed::CommType, bool, bool) 3 paddle::distributed::ProcessGroupNCCL::CreateNCCLEnvCache(phi::Place const&, std::string const&) 4 ncclCommInitRank

Error Message Summary: FatalError: Termination signal is detected by the operating system. [TimeInfo: Aborted at 1693816330 (unix time) try "date -d @1693816330" if you are using GNU date ] [SignalInfo: SIGTERM (@0x6b8) received by PID 1761 (TID 0x7f639e388740) from PID 1720 ]

C++ Traceback (most recent call last): 0 paddle::distributed::ProcessGroupNCCL::Barrier(paddle::distributed::BarrierOptions const&) 1 paddle::distributed::ProcessGroupNCCL::AllReduce(phi::DenseTensor, phi::DenseTensor const&, paddle::distributed::AllreduceOptions const&, bool, bool) 2 paddle::distributed::ProcessGroupNCCL::RunFnInNCCLEnv(std::function<void (ncclComm, CUstream_st*)>, phi::DenseTensor const&, paddle::distributed::CommType, bool, bool) 3 paddle::distributed::ProcessGroupNCCL::CreateNCCLEnvCache(phi::Place const&, std::string const&) 4 ncclCommInitRank

Error Message Summary: FatalError: Termination signal is detected by the operating system. [TimeInfo: Aborted at 1693816330 (unix time) try "date -d @1693816330" if you are using GNU date ] [SignalInfo: SIGTERM (@0x6b8) received by PID 1763 (TID 0x7fa368bda740) from PID 1720 ]

WARNING:root:PaddlePaddle meets some problem with 4 GPUs. This may be caused by:

There is not enough GPUs visible on your system Some GPUs are occupied by other process now NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html WARNING:root: Original Error is: Process 2 terminated with exit code 1. PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now. Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.7/dist-packages/paddle/utils/install_check.py", line 282, in run_check raise e File "/usr/local/lib/python3.7/dist-packages/paddle/utils/install_check.py", line 255, in run_check _run_parallel(device_list) File "/usr/local/lib/python3.7/dist-packages/paddle/utils/install_check.py", line 206, in _run_parallel paddle.distributed.spawn(train_for_run_parallel, nprocs=len(device_list)) File "/usr/local/lib/python3.7/dist-packages/paddle/distributed/spawn.py", line 595, in spawn while not context.join(): File "/usr/local/lib/python3.7/dist-packages/paddle/distributed/spawn.py", line 399, in join self._throw_exception(error_index) File "/usr/local/lib/python3.7/dist-packages/paddle/distributed/spawn.py", line 413, in _throw_exception % (error_index, exitcode) Exception: Process 2 terminated with exit code 1. 这是运行后出现的整体信息 没有出现cudaErrorNoKernelImageForDevice的问题 能麻烦您再帮我看一下这个的问题可能是什么嘛?

BaoyuLi12138 commented 8 months ago

您好 我这边发现问题了 因为同目录下 两个代码所需要的环境不一样导致 感谢大神的指导~

BaoyuLi12138 commented 8 months ago

记录一下最后的 cuda 10.2 cudnn 7.6 paddlepaddle-gpu 2.5.1-post102 paddlenlp 2.6.0

tianji2018 commented 8 months ago

虽然报的是warning,但是完全不能用,计算都是错的。

import paddle paddle.ones([3,3]) W0913 16:00:46.068766 51120 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0913 16:00:46.068822 51120 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0913 16:00:46.073894 51120 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. Tensor(shape=[3, 3], dtype=float32, place=Place(gpu:0), stop_gradient=True, [[0., 0., 0.], [0., 0., 0.], [0., 0., 0.]])

YanhuiDua commented 8 months ago

虽然报的是warning,但是完全不能用,计算都是错的。

import paddle paddle.ones([3,3]) W0913 16:00:46.068766 51120 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0913 16:00:46.068822 51120 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0913 16:00:46.073894 51120 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. Tensor(shape=[3, 3], dtype=float32, place=Place(gpu:0), stop_gradient=True, [[0., 0., 0.], [0., 0., 0.], [0., 0., 0.]])

运行下,python -c "import paddle;paddle.utils.run_check()"看下是否正确安装

tianji2018 commented 8 months ago

python -c "import paddle;paddle.utils.run_check()"

Running verify PaddlePaddle program ... I0913 16:34:03.777916 66339 interpretercore.cc:237] New Executor is Running. W0913 16:34:03.778102 66339 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0913 16:34:03.778110 66339 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0913 16:34:03.779343 66339 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. I0913 16:34:03.961378 66339 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

YanhuiDua commented 8 months ago

python -c "import paddle;paddle.utils.run_check()"

Running verify PaddlePaddle program ... I0913 16:34:03.777916 66339 interpretercore.cc:237] New Executor is Running. W0913 16:34:03.778102 66339 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0913 16:34:03.778110 66339 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0913 16:34:03.779343 66339 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. I0913 16:34:03.961378 66339 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

安装是正确的,麻烦提供下硬件设备和paddle版本

tianji2018 commented 8 months ago

python -c "import paddle;paddle.utils.run_check()"

Running verify PaddlePaddle program ... I0913 16:34:03.777916 66339 interpretercore.cc:237] New Executor is Running. W0913 16:34:03.778102 66339 gpu_resources.cc:96] The GPU architecture in your current machine is Pascal, which is not compatible with Paddle installation with arch: 70 75 80 86 , it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0913 16:34:03.778110 66339 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0913 16:34:03.779343 66339 gpu_resources.cc:149] device: 0, cuDNN Version: 8.2. I0913 16:34:03.961378 66339 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

安装是正确的,麻烦提供下硬件设备和paddle版本

系统:Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-162-generic x86_64) 显卡:NVIDIA P100 16G 显卡驱动版本:11.2 版本号:NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 cudnn 8.2 conda环境:python 3.10.9 paddle版本:paddlepaddle-gpu==2.5.1.post112,通过命令python -m pip install paddlepaddle-gpu==2.5.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html安装(conda命令也试了,一样的问题

nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Mon_Nov_30_19:08:53_PST_2020 Cuda compilation tools, release 11.2, V11.2.67 Build cuda_11.2.r11.2/compiler.29373293_0

YanhuiDua commented 8 months ago

看上去版本都是对应的,测试下别的API呢?

tianji2018 commented 8 months ago

看上去版本都是对应的,测试下别的API呢?

x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3]) # Tensor(shape=[4], dtype=float32, place=Place(gpu:0), stop_gradient=True,[-0.40000001, -0.20000000, 0.10000000, 0.30000001]) paddle.abs(x) # [0., 0., 0., 0.] paddle.argmax(x) # 0 paddle.argmin(x) # 0 x+1 # [0., 0., 0., 0.] x = paddle.to_tensor([-4,-2,1,3]) # Tensor(shape=[4], dtype=int64, place=Place(gpu:0), stop_gradient=True,[-4, -2, 1, 3]) x+1 #[-4734183924231779123, 4510805389529107661, 0 ,0 ] paddle.isnan(x) # Aborted (core dumped)崩溃退出

YanhuiDua commented 8 months ago

好的收到,我们看下在 2023年9月13日,17:46,tianji2018 @.***> 写道:

看上去版本都是对应的,测试下别的API呢?

x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3]) # Tensor(shape=[4], dtype=float32, place=Place(gpu:0), stop_gradient=True,[-0.40000001, -0.20000000, 0.10000000, 0.30000001]) paddle.abs(x) # [0., 0., 0., 0.] paddle.argmax(x) # 0 paddle.argmin(x) # 0 x+1 # [0., 0., 0., 0.] x = paddle.to_tensor([-4,-2,1,3]) # Tensor(shape=[4], dtype=int64, place=Place(gpu:0), stop_gradient=True,[-4, -2, 1, 3]) x+1 #[-4734183924231779123, 4510805389529107661, 0 ,0 ] paddle.isnan(x) # Aborted (core dumped)崩溃退出

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you modified the open/close state.Message ID: @.***>

tianji2018 commented 8 months ago

好的收到,我们看下

辛苦了

YanhuiDua commented 8 months ago

好的收到,我们看下

辛苦了

你好,P100(sm60)的架构使用cuda11.2的包可能会遇到问题,建议尝试下使用cuda10.2的whl包或者源码编译下

neuxys commented 8 months ago

好的收到,我们看下

辛苦了

你好,P100(sm60)的架构使用cuda11.2的包可能会遇到问题,建议尝试下使用cuda10.2的whl包或者源码编译下

您好,我也遇到了同样的问题,我发现在选择cuda的时候版本10.2没有支持ubuntu20.04,怎么办呢

YanhuiDua commented 8 months ago

好的收到,我们看下

辛苦了

你好,P100(sm60)的架构使用cuda11.2的包可能会遇到问题,建议尝试下使用cuda10.2的whl包或者源码编译下

您好,我也遇到了同样的问题,我发现在选择cuda的时候版本10.2没有支持ubuntu20.04,怎么办呢

你好,这个需要自己编译下~ 建议使用ubuntu18.04,20.04可能会遇到问题。编译可以参考https://www.paddlepaddle.org.cn/documentation/docs/zh/install/compile/linux-compile-by-make.html;

neuxys commented 8 months ago

好的收到,我们看下

辛苦了

你好,P100(sm60)的架构使用cuda11.2的包可能会遇到问题,建议尝试下使用cuda10.2的whl包或者源码编译下

您好,我也遇到了同样的问题,我发现在选择cuda的时候版本10.2没有支持ubuntu20.04,怎么办呢

你好,这个需要自己编译下~ 建议使用ubuntu18.04,20.04可能会遇到问题。编译可以参考https://www.paddlepaddle.org.cn/documentation/docs/zh/install/compile/linux-compile-by-make.html;

感谢您的回复!好的,我来尝试

tianji2018 commented 8 months ago

好的收到,我们看下

辛苦了

你好,P100(sm60)的架构使用cuda11.2的包可能会遇到问题,建议尝试下使用cuda10.2的whl包或者源码编译下

你好,我将paddle降级为2.4.0和2.4.2后,测试都可以正常工作了。

schild commented 7 months ago

版本乱的一批,搞一周了还没好

YanhuiDua commented 7 months ago

版本乱的一批,搞一周了还没好

请问具体是遇到什么问题了呢?可以新提一个issue提问