Oneflow-Inc / libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
https://libai.readthedocs.io
Apache License 2.0
390 stars 55 forks source link

单机多卡跑gpt2_pretrain.py遇到如下问题 #534

Open treestreamymw opened 9 months ago

treestreamymw commented 9 months ago

F20240306 12:52:30.421669 11024 ctrl_client.cpp:54] Check failed: rpcclient.GetStubAt(i)->CallMethod( &client_ctx, request, &response).error_code() == grpc::StatusCode::OK (14 vs. 0) Machine 0 lost Check failure stack trace: @ 0x7fa53f8039ca google::LogMessage::Fail() @ 0x7fa53f803cb2 google::LogMessage::SendToLog() @ 0x7fa53f803537 google::LogMessage::Flush() @ 0x7fa53f8060a9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fa535118195 _ZZN7oneflow14GrpcCtrlClientC4ERKNS_10ProcessCtxEENKUlvE_clEv @ 0x7fa53f81840f execute_native_thread_routine @ 0x7fa6292476db start_thread @ 0x7fa62882861f clone