Closed WonderingWJ closed 2 years ago
docker run --gpus=all --rm -it --cap-add SYS_NICE --gpus '"device=0,1,2,3"' -it -u root:root oneflowinc/oneflow:nightly-cuda11.2
sh train.sh
dcn
Get the error *** Check failure stack trace: *** F20220828 03:05:09.556852 2250 io_event_poller.cpp:95] Check failed: !(cur_event->events & EPOLLERR) fd: 52: Resource temporarily unavailable [11] *** Check failure stack trace: *** F20220828 03:05:09.557612 1837 ctrl_client.cpp:54] Check failed: rpc_client_.GetStubAt(i)->CallMethod<CtrlMethod::kLoadServer>( &client_ctx, request, &response).error_code() == grpc::StatusCode::OK (14 vs. 0) Machine 1 lost *** Check failure stack trace: ***
GPU environment
nvidia-smi Sun Aug 28 03:09:34 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100 80G... On | 00000000:1A:00.0 Off | 0 | | N/A 28C P0 41W / 300W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100 80G... On | 00000000:1B:00.0 Off | 0 | | N/A 27C P0 41W / 300W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100 80G... On | 00000000:3D:00.0 Off | 0 | | N/A 25C P0 41W / 300W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100 80G... On | 00000000:3E:00.0 Off | 0 | | N/A 26C P0 41W / 300W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
Thanks for your feedback. You can set the "--ipc=host" when running the container.
sh train.sh
under folderdcn
GPU environment