Oneflow-Inc / OneFlow-Benchmark

OneFlow models for benchmarking.
104 stars 31 forks source link

run sh train.sh hangs in resnet50 benchmark with 4 and 8 gpus of single machine #152

Open wuyujiji opened 3 years ago

wuyujiji commented 3 years ago

Question

Hi, recently I build the oneflow environment and run the resnet50 of OneFlow-Benchmark, it runs successfully when use 1 gpus of single machine and 2 gpus of single machine, but hangs when 4 gpus of single machine and 8 gpus of single machine.

Envrionment

gpu: Tesla V100 16g python:3.6 cuda: 10.0 cudnn: 7 oneflow: 0.2.0 OneFlow-benchmark: master@f09f31ea8c3da6a1cc193081eb544b92d8e504c2

log info: NUM_EPOCH=2 DATA_ROOT=/workdir/data/mini-imagenet/ofrecord

Running resnet50: num_gpu_per_node = 4, num_nodes = 1.

dtype = float32 gpu_num_per_node = 4 num_nodes = 1 node_ips = ['192.168.1.13', '192.168.1.14'] ctrl_port = 50051 model = resnet50 use_fp16 = None use_xla = None channel_last = None pad_output = None num_epochs = 2 model_load_dir = None batch_size_per_device = 128 val_batch_size_per_device = 50 nccl_fusion_threshold_mb = 0 nccl_fusion_max_ops = 0 fuse_bn_relu = False fuse_bn_add_relu = False gpu_image_decoder = False image_path = test_img/tiger.jpg num_classes = 1000 num_examples = 1281167 num_val_examples = 50000 rgb_mean = [123.68, 116.779, 103.939] rgb_std = [58.393, 57.12, 57.375] image_shape = [3, 224, 224] label_smoothing = 0.1 model_save_dir = ./output/snapshots/model_save-20201028202443 log_dir = ./output loss_print_every_n_iter = 100 image_size = 224 resize_shorter = 256 train_data_dir = /workdir/data/mini-imagenet/ofrecord/train train_data_part_num = 8 val_data_dir = /workdir/data/mini-imagenet/ofrecord/val val_data_part_num = 8 optimizer = sgd learning_rate = 1.024 wd = 3.0517578125e-05 momentum = 0.875 lr_decay = cosine lr_decay_rate = 0.94 lr_decay_epochs = 2 warmup_epochs = 5 decay_rate = 0.9 epsilon = 1.0 gradient_clipping = 0.0

Time stamp: 2020-10-28-20:24:43 Loading data from /workdir/data/mini-imagenet/ofrecord/train Optimizer: SGD Loading data from /workdir/data/mini-imagenet/ofrecord/val

Then, it hangs for a long time

To Reproduce

  1. build the oneflow envrionment python3 -m pip install --find-links https://oneflow-inc.github.io/nightly oneflow_cu100
  2. clone the source of OneFlow-benchmark git clone https://github.com/Oneflow-Inc/OneFlow-Benchmark.git
  3. download the mini-imagenet note: For running multi-gpu of one machine, I copy the part-00000 into 8 pieces of data in train and validation folders, respectively
  4. change the content of the shell scripit cd Classification/cnns/ vim train.sh set --train_data_part_num=8 set --val_data_part_num=8 set gpu_num_per_node=4 # gpu numbers is 1, 2, 4 ,8, respectively. 1 and 2 is normal, but 4 and 8 hangs.
  5. run the shell scripts sh train.sh
wuyujiji commented 3 years ago

This is a new error of 3 gpus:

F1028 20:39:07.626992 206314 collective_boxing_executor.cpp:452] Check failed: ncclGroupEnd() : unhandled system error (2) Check failure stack trace: @ 0x7f09480a08dd google::LogMessage::Fail() @ 0x7f09480a4a1c google::LogMessage::SendToLog() @ 0x7f09480a0403 google::LogMessage::Flush() @ 0x7f09480a5439 google::LogMessageFatal::~LogMessageFatal() @ 0x7f0947005ff4 oneflow::boxing::collective::NcclCollectiveBoxingExecutorBackend::Init() @ 0x7f0947006e5d oneflow::boxing::collective::CollectiveBoxingExecutor::CollectiveBoxingExecutor() @ 0x7f09470adcc9 oneflow::Runtime::NewAllGlobal() @ 0x7f09470ae63e oneflow::Runtime::Runtime() @ 0x7f0947095bf3 (unknown) @ 0x7f0946e24425 (unknown) @ 0x7f096f40a04a _PyCFunction_FastCallDict @ 0x7f096f475a3f (unknown) @ 0x7f096f46a0a7 _PyEval_EvalFrameDefault @ 0x7f096f47582a (unknown) @ 0x7f096f475b63 (unknown) @ 0x7f096f46a0a7 _PyEval_EvalFrameDefault @ 0x7f096f47582a (unknown) @ 0x7f096f475b63 (unknown) @ 0x7f096f475b63 (unknown) @ 0x7f096f46a0a7 _PyEval_EvalFrameDefault @ 0x7f096f47582a (unknown) @ 0x7f096f475b63 (unknown) @ 0x7f096f46a0a7 _PyEval_EvalFrameDefault @ 0x7f096f474c5a (unknown) @ 0x7f096f4758da (unknown) @ 0x7f096f46a0a7 _PyEval_EvalFrameDefault @ 0x7f096f476c5a _PyFunction_FastCallDict @ 0x7f096f3cc6be _PyObject_FastCallDict @ 0x7f096f3cc7d1 _PyObject_Call_Prepend @ 0x7f096f3cc443 PyObject_Call @ 0x7f096f41f555 (unknown) @ 0x7f096f41bf12 (unknown)