Oneflow-Inc / models

Models and examples built with OneFlow
Apache License 2.0
94 stars 37 forks source link

Machine x lost when running DCN benchmark #381

Closed WonderingWJ closed 2 years ago

WonderingWJ commented 2 years ago
  1. Starting a docker with nightly image
    docker run --gpus=all --rm -it --cap-add SYS_NICE
    --gpus '"device=0,1,2,3"'  -it -u  root:root oneflowinc/oneflow:nightly-cuda11.2
  2. sh train.sh under folder dcn
    Get the error
    *** Check failure stack trace: ***
    F20220828 03:05:09.556852  2250 io_event_poller.cpp:95] Check failed: !(cur_event->events & EPOLLERR) fd: 52: Resource temporarily unavailable [11]
    *** Check failure stack trace: ***
    F20220828 03:05:09.557612  1837 ctrl_client.cpp:54] Check failed: rpc_client_.GetStubAt(i)->CallMethod<CtrlMethod::kLoadServer>( &client_ctx, request, &response).error_code() == grpc::StatusCode::OK (14 vs. 0) Machine 1 lost
    *** Check failure stack trace: *** 

    GPU environment

    nvidia-smi
    Sun Aug 28 03:09:34 2022
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA A100 80G...  On   | 00000000:1A:00.0 Off |                    0 |
    | N/A   28C    P0    41W / 300W |      0MiB / 81920MiB |      0%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
    |   1  NVIDIA A100 80G...  On   | 00000000:1B:00.0 Off |                    0 |
    | N/A   27C    P0    41W / 300W |      0MiB / 81920MiB |      0%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
    |   2  NVIDIA A100 80G...  On   | 00000000:3D:00.0 Off |                    0 |
    | N/A   25C    P0    41W / 300W |      0MiB / 81920MiB |      0%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
    |   3  NVIDIA A100 80G...  On   | 00000000:3E:00.0 Off |                    0 |
    | N/A   26C    P0    41W / 300W |      0MiB / 81920MiB |      0%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
shangguanshiyuan commented 2 years ago

Thanks for your feedback. You can set the "--ipc=host" when running the container.