PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.3k stars 5.62k forks source link

a100显卡运行时报错paddle与cuda不适配 #65312

Open BITprogramMan opened 5 months ago

BITprogramMan commented 5 months ago

bug描述 Describe the Bug

使用的paddle版本: paddlepaddle-gpu 1.6.2.post107 cuda版本: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0 报错信息: W0620 12:53:27.994474 149242 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 80, Driver API Version: 12.0, Runtime API Version: 10.0 W0620 12:53:27.995631 149242 device_context.cc:244] device: 0, cuDNN Version: 8.6. W0620 12:53:28.540676 149242 operator.cc:179] truncated_gaussian_random raises an exception thrust::system::system_error, parallel_for failed: no kernel image is available for execution on the device /root/paddlejob/workspace/env_run/valuation/python2/lib/python2.7/site-packages/paddle/fluid/executor.py:779: UserWarning: The following exception is not an EOF exception. "The following exception is not an EOF exception.") Traceback (most recent call last): File "run_with_json.py", line 112, in trainer = build_trainer(trainer_params_dict, dataset_reader, model, num_train_examples) File "run_with_json.py", line 83, in build_trainer trainer = trainer_class(params=params_dict, data_set_reader=dataset_reader, model_class=model) File "/root/paddlejob/workspace/env_run/valuation/textone/training/custom_trainer.py", line 30, in init BaseTrainer.init(self, params, data_set_reader, model_class) File "/root/paddlejob/workspace/env_run/valuation/textone/training/base_trainer.py", line 53, in init self.executor.run(self.startup_program) File "/root/paddlejob/workspace/env_run/valuation/python2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 780, in run six.reraise(*sys.exc_info()) File "/root/paddlejob/workspace/env_run/valuation/python2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 775, in run use_program_cache=use_program_cache) File "/root/paddlejob/workspace/env_run/valuation/python2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 822, in _run_impl use_program_cache=use_program_cache) File "/root/paddlejob/workspace/env_run/valuation/python2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 899, in _run_program fetch_var_name) RuntimeError: parallel_for failed: no kernel image is available for execution on the device

其他补充信息 Additional Supplementary Information

No response

warrentdrew commented 5 months ago

您好,从使用的paddle版本paddlepaddle-gpu 1.6.2.post107来看是基于cuda10.7的版本,与cuda11.8环境不符合,需升级paddle版本,参考官网paddle安装说明 python -m pip install paddlepaddle-gpu==2.6.1 -i https://pypi.tuna.tsinghua.edu.cn/simple