PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.27k stars 5.61k forks source link

运行optimizer.minimize(model_dict.loss, model_dict.startup_program)时报错ExternalError: NCCL error(1), unhandled cuda error. #54171

Open hrbxjylxjj opened 1 year ago

hrbxjylxjj commented 1 year ago

请提出你的问题 Please ask your question

使用GPU进行模型训练,观察代码发现运行到optimizer.minimize(model_dict.loss, model_dict.startup_program)这一行时报错

C++ Traceback (most recent call last):

0 paddle::framework::StandaloneExecutor::Run(paddle::framework::Scope, std::vector<std::string, std::allocator > const&, std::vector<std::string, std::allocator > const&) 1 paddle::framework::InterpreterCore::Run(std::vector<std::string, std::allocator > const&, bool) 2 paddle::framework::interpreter::BuildOpFuncList(phi::Place const&, paddle::framework::BlockDesc const&, std::set<std::string, std::less, std::allocator > const&, std::vector<paddle::framework::OpFuncNode, std::allocator >, paddle::framework::VariableScope, paddle::framework::interpreter::ExecutionConfig const&, bool) 3 paddle::framework::interpreter::HandleOperatorBase(phi::Place const&, paddle::framework::VariableScope const, std::shared_ptr, paddle::framework::OpFuncNode, paddle::framework::Scope) 4 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&) 5 paddle::operators::CCommInitAllOp::RunImpl(paddle::framework::Scope const&, phi::Place const&) const 6 paddle::platform::NCCLCommContext::CreateAllNCCLComms(std::vector<int, std::allocator > const&, int) 7 phi::enforce::EnforceNotMet::EnforceNotMet(phi::ErrorSummary const&, char const*, int) 8 phi::enforce::GetCurrentTraceBackStringabi:cxx11

ExternalError: NCCL error(1), unhandled cuda error. [Hint: Please search for the error code(1) on website (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#ncclresult-t) to get Nvidia's official solution and advice about NCCL Error.] (at /root/paddlejob/paddle.jxw/Paddle/paddle/fluid/platform/collective_helper.cc:137) [operator < c_comm_init_all > error] [operator < c_comm_init_all > error]

2742195759 commented 1 year ago

minimize 是不是应该传入 main_program,而不是 start up program呢?

MingqiFANG commented 10 months ago

optimizer.minimize应该不用传入main_program或者 start up program吧?直接传入model.loss就好了