ExternalError: NCCL error(1), unhandled cuda error.
[Hint: Please search for the error code(1) on website (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#ncclresult-t) to get Nvidia's official solution and advice about NCCL Error.] (at /root/paddlejob/paddle.jxw/Paddle/paddle/fluid/platform/collective_helper.cc:137)
[operator < c_comm_init_all > error] [operator < c_comm_init_all > error]
请提出你的问题 Please ask your question
使用GPU进行模型训练,观察代码发现运行到optimizer.minimize(model_dict.loss, model_dict.startup_program)这一行时报错
C++ Traceback (most recent call last):
0 paddle::framework::StandaloneExecutor::Run(paddle::framework::Scope, std::vector<std::string, std::allocator > const&, std::vector<std::string, std::allocator > const&)
1 paddle::framework::InterpreterCore::Run(std::vector<std::string, std::allocator > const&, bool)
2 paddle::framework::interpreter::BuildOpFuncList(phi::Place const&, paddle::framework::BlockDesc const&, std::set<std::string, std::less, std::allocator > const&, std::vector<paddle::framework::OpFuncNode, std::allocator > , paddle::framework::VariableScope, paddle::framework::interpreter::ExecutionConfig const&, bool)
3 paddle::framework::interpreter::HandleOperatorBase(phi::Place const&, paddle::framework::VariableScope const, std::shared_ptr, paddle::framework::OpFuncNode, paddle::framework::Scope)
4 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&)
5 paddle::operators::CCommInitAllOp::RunImpl(paddle::framework::Scope const&, phi::Place const&) const
6 paddle::platform::NCCLCommContext::CreateAllNCCLComms(std::vector<int, std::allocator > const&, int)
7 phi::enforce::EnforceNotMet::EnforceNotMet(phi::ErrorSummary const&, char const*, int)
8 phi::enforce::GetCurrentTraceBackStringabi:cxx11
ExternalError: NCCL error(1), unhandled cuda error. [Hint: Please search for the error code(1) on website (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#ncclresult-t) to get Nvidia's official solution and advice about NCCL Error.] (at /root/paddlejob/paddle.jxw/Paddle/paddle/fluid/platform/collective_helper.cc:137) [operator < c_comm_init_all > error] [operator < c_comm_init_all > error]