Oneflow-Inc / OneAutoTest

Auto-Test System
Apache License 2.0
5 stars 6 forks source link

resnet50在第一轮中途突然结束 #82

Open smile0655 opened 1 year ago

smile0655 commented 1 year ago

问题:日常任务的resnet50没跑通,在第一轮中途突然结束。 脚本:https://github.com/Oneflow-Inc/OneAutoTest/blob/main/onebench/models/ResNet50/run_week.sh 错误日志:

Details

***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** ------------------------ arguments ------------------------ batches_per_epoch ............................... 1000 channel_last .................................... True ddp ............................................. False fuse_bn_add_relu ................................ True fuse_bn_relu .................................... True gpu_stat_file ................................... None grad_clipping ................................... 0.0 graph ........................................... True label_smoothing ................................. 0.1 learning_rate ................................... 1.28 legacy_init ..................................... False load_path ....................................... None lr_decay_type ................................... cosine metric_local .................................... True metric_train_acc ................................ True momentum ........................................ 0.875 nccl_fusion_max_ops ............................. 24 nccl_fusion_threshold_mb ........................ 16 num_classes ..................................... 1000 num_devices_per_node ............................ 8 num_epochs ...................................... 50 num_nodes ....................................... 1 ofrecord_part_num ............................... 256 ofrecord_path ................................... /ssd/dataset/ImageNet/ofrecord print_interval .................................. 100 print_timestamp ................................. False samples_per_epoch ............................... 1281167 save_init ....................................... False save_path ....................................... None scale_grad ...................................... True skip_eval ....................................... False synthetic_data .................................. False total_batches ................................... -1 train_batch_size ................................ 40 train_global_batch_size ......................... 1280 use_fp16 ........................................ True use_gpu_decode .................................. True val_batch_size .................................. 20 val_batches_per_epoch ........................... 78 val_global_batch_size ........................... 640 val_samples_per_epoch ........................... 50000 warmup_epochs ................................... 5 weight_decay .................................... 3.0517578125e-05 zero_init_residual .............................. True -------------------- end of arguments --------------------- ***** Model Init ***** W20230526 08:19:08.200799 999559 eager_local_op_interpreter.cpp:272] Casting a local tensor to a global tensor with Broadcast sbp will modify the data of input! If you want to keep the input local tensor unchanged, please set the arg copy to True. ***** Model Init Finish, time escapled: 1.93146 s ***** [rank:2] [train], epoch: 0/50, iter: 100/1000, loss: 0.86197, top1: 0.00300, throughput: 84.68 | 2023-05-26 08:19:55.622 [rank:6] [train], epoch: 0/50, iter: 100/1000, loss: 0.86224, top1: 0.00300, throughput: 84.68 | 2023-05-26 08:19:55.623[rank:5] [train], epoch: 0/50, iter: 100/1000, loss: 0.86172, top1: 0.00262, throughput: 84.68 | 2023-05-26 08:19:55.624 [rank:7] [train], epoch: 0/50, iter: 100/1000, loss: 0.86216, top1: 0.00350, throughput: 84.68 | 2023-05-26 08:19:55.625 [rank:0] [train], epoch: 0/50, iter: 100/1000, loss: 0.86167, top1: 0.00281, throughput: 84.68 | 2023-05-26 08:19:55.625 [rank:4] [train], epoch: 0/50, iter: 100/1000, loss: 0.86215, top1: 0.00256, throughput: 84.68 | 2023-05-26 08:19:55.622 [rank:1] [train], epoch: 0/50, iter: 100/1000, loss: 0.86248, top1: 0.00275, throughput: 84.67 | 2023-05-26 08:19:55.624 [rank:3] [train], epoch: 0/50, iter: 100/1000, loss: 0.86204, top1: 0.00294, throughput: 84.67 | 2023-05-26 08:19:55.624 timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/05/26 08:19:55.698, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/05/26 08:19:55.699, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/05/26 08:19:55.700, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/05/26 08:19:55.700, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB timestamp, name, driver_version, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB] 2023/05/26 08:19:55.701, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB 2023/05/26 08:19:55.701, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB 2023/05/26 08:19:55.702, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB 2023/05/26 08:19:55.703, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB 2023/05/26 08:19:55.703, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB 2023/05/26 08:19:55.703, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB 2023/05/26 08:19:55.703, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB 2023/05/26 08:19:55.704, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 50 %, 24576 MiB, 13402 MiB, 10865 MiB 2023/05/26 08:19:55.705, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB 2023/05/26 08:19:55.706, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB 2023/05/26 08:19:55.708, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.709, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB 2023/05/26 08:19:55.709, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB 2023/05/26 08:19:55.709, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB 2023/05/26 08:19:55.710, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB 2023/05/26 08:19:55.712, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 62 %, 24576 MiB, 15675 MiB, 8592 MiB 2023/05/26 08:19:55.712, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.713, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB 2023/05/26 08:19:55.715, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB 2023/05/26 08:19:55.716, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB 2023/05/26 08:19:55.716, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB 2023/05/26 08:19:55.716, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.718, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB 2023/05/26 08:19:55.719, NVIDIA GeForce RTX 3090, 515.65.01, 100 %, 49 %, 24576 MiB, 15681 MiB, 8586 MiB 2023/05/26 08:19:55.720, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB 2023/05/26 08:19:55.721, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.723, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB 2023/05/26 08:19:55.724, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.724, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.725, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB 2023/05/26 08:19:55.726, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.728, NVIDIA GeForce RTX 3090, 515.65.01, 92 %, 57 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.728, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB 2023/05/26 08:19:55.729, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB 2023/05/26 08:19:55.731, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB 2023/05/26 08:19:55.732, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB 2023/05/26 08:19:55.732, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB 2023/05/26 08:19:55.733, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB 2023/05/26 08:19:55.733, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB 2023/05/26 08:19:55.735, NVIDIA GeForce RTX 3090, 515.65.01, 73 %, 41 %, 24576 MiB, 15707 MiB, 8560 MiB 2023/05/26 08:19:55.735, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB 2023/05/26 08:19:55.736, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB 2023/05/26 08:19:55.738, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.740, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB 2023/05/26 08:19:55.740, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB 2023/05/26 08:19:55.741, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB 2023/05/26 08:19:55.742, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB 2023/05/26 08:19:55.744, NVIDIA GeForce RTX 3090, 515.65.01, 99 %, 63 %, 24576 MiB, 15713 MiB, 8554 MiB 2023/05/26 08:19:55.744, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.745, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB 2023/05/26 08:19:55.748, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB 2023/05/26 08:19:55.748, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB 2023/05/26 08:19:55.748, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.749, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB 2023/05/26 08:19:55.751, NVIDIA GeForce RTX 3090, 515.65.01, 83 %, 50 %, 24576 MiB, 15693 MiB, 8574 MiB 2023/05/26 08:19:55.752, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.754, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.755, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.756, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB 2023/05/26 08:19:55.758, NVIDIA GeForce RTX 3090, 515.65.01, 66 %, 35 %, 24576 MiB, 15699 MiB, 8568 MiB [rank:3] [train], epoch: 0/50, iter: 200/1000, loss: 0.83563, top1: 0.01081, throughput: 264.51 | 2023-05-26 08:20:10.746 [rank:4] [train], epoch: 0/50, iter: 200/1000, loss: 0.83491, top1: 0.01112, throughput: 264.48 | 2023-05-26 08:20:10.746 [rank:5] [train], epoch: 0/50, iter: 200/1000, loss: 0.83465, top1: 0.01087, throughput: 264.50 | 2023-05-26 08:20:10.747 [rank:0] [train], epoch: 0/50, iter: 200/1000, loss: 0.83560, top1: 0.01038, throughput: 264.48 | 2023-05-26 08:20:10.749 [rank:2] [train], epoch: 0/50, iter: 200/1000, loss: 0.83530, top1: 0.00956, throughput: 264.44 | 2023-05-26 08:20:10.748 [rank:1] [train], epoch: 0/50, iter: 200/1000, loss: 0.83483, top1: 0.01100, throughput: 264.50 | 2023-05-26 08:20:10.747 [rank:6] [train], epoch: 0/50, iter: 200/1000, loss: 0.83393, top1: 0.01156, throughput: 264.48 | 2023-05-26 08:20:10.747 [rank:7] [train], epoch: 0/50, iter: 200/1000, loss: 0.83595, top1: 0.01075, throughput: 264.48 | 2023-05-26 08:20:10.749 [rank:0] [train], epoch: 0/50, iter: 300/1000, loss: 0.81221, top1: 0.01681, throughput: 279.81 | 2023-05-26 08:20:25.044 [rank:3] [train], epoch: 0/50, iter: 300/1000, loss: 0.81065, top1: 0.01719, throughput: 279.77 | 2023-05-26 08:20:25.044 [rank:4] [train], epoch: 0/50, iter: 300/1000, loss: 0.81161, top1: 0.01575, throughput: 279.77 | 2023-05-26 08:20:25.044 [rank:6] [train], epoch: 0/50, iter: 300/1000, loss: 0.81175, top1: 0.01575, throughput: 279.77 | 2023-05-26 08:20:25.045 [rank:7] [train], epoch: 0/50, iter: 300/1000, loss: 0.81004, top1: 0.01738, throughput: 279.80 | 2023-05-26 08:20:25.045 [rank:2] [train], epoch: 0/50, iter: 300/1000, loss: 0.81092, top1: 0.01550, throughput: 279.79 | 2023-05-26 08:20:25.044 [rank:1] [train], epoch: 0/50, iter: 300/1000, loss: 0.81099, top1: 0.01838, throughput: 279.77 | 2023-05-26 08:20:25.045 [rank:5] [train], epoch: 0/50, iter: 300/1000, loss: 0.81109, top1: 0.01731, throughput: 279.77 | 2023-05-26 08:20:25.045 [rank:2] [train], epoch: 0/50, iter: 400/1000, loss: 0.79602, top1: 0.02162, throughput: 285.49 | 2023-05-26 08:20:39.055 [rank:7] [train], epoch: 0/50, iter: 400/1000, loss: 0.79583, top1: 0.02387, throughput: 285.50 | 2023-05-26 08:20:39.055 [rank:4] [train], epoch: 0/50, iter: 400/1000, loss: 0.79490, top1: 0.02044, throughput: 285.47 | 2023-05-26 08:20:39.056 [rank:5] [train], epoch: 0/50, iter: 400/1000, loss: 0.79409, top1: 0.02188, throughput: 285.49 | 2023-05-26 08:20:39.056 [rank:1] [train], epoch: 0/50, iter: 400/1000, loss: 0.79505, top1: 0.02175, throughput: 285.51 | 2023-05-26 08:20:39.055 [rank:3] [train], epoch: 0/50, iter: 400/1000, loss: 0.79605, top1: 0.02013, throughput: 285.48 | 2023-05-26 08:20:39.055 [rank:0] [train], epoch: 0/50, iter: 400/1000, loss: 0.79391, top1: 0.02344, throughput: 285.47 | 2023-05-26 08:20:39.056 [rank:6] [train], epoch: 0/50, iter: 400/1000, loss: 0.79521, top1: 0.02175, throughput: 285.49 | 2023-05-26 08:20:39.056 [rank:3] [train], epoch: 0/50, iter: 500/1000, loss: 0.78024, top1: 0.02750, throughput: 280.28 | 2023-05-26 08:20:53.327 [rank:4] [train], epoch: 0/50, iter: 500/1000, loss: 0.78196, top1: 0.02756, throughput: 280.26 | 2023-05-26 08:20:53.328 [rank:5] [train], epoch: 0/50, iter: 500/1000, loss: 0.78153, top1: 0.02837, throughput: 280.25 | 2023-05-26 08:20:53.329 [rank:7] [train], epoch: 0/50, iter: 500/1000, loss: 0.78232, top1: 0.02688, throughput: 280.24 | 2023-05-26 08:20:53.329 [rank:0] [train], epoch: 0/50, iter: 500/1000, loss: 0.78100, top1: 0.02794, throughput: 280.24 | 2023-05-26 08:20:53.329 [rank:2] [train], epoch: 0/50, iter: 500/1000, loss: 0.78063, top1: 0.02869, throughput: 280.26 | 2023-05-26 08:20:53.328 [rank:6] [train], epoch: 0/50, iter: 500/1000, loss: 0.78208, top1: 0.02794, throughput: 280.24 | 2023-05-26 08:20:53.329 [rank:1] [train], epoch: 0/50, iter: 500/1000, loss: 0.78039, top1: 0.02725, throughput: 280.22 | 2023-05-26 08:20:53.330 [rank:4] [train], epoch: 0/50, iter: 600/1000, loss: 0.76739, top1: 0.03381, throughput: 285.81 | 2023-05-26 08:21:07.323 [rank:6] [train], epoch: 0/50, iter: 600/1000, loss: 0.76635, top1: 0.03394, throughput: 285.83 | 2023-05-26 08:21:07.324 [rank:1] [train], epoch: 0/50, iter: 600/1000, loss: 0.76907, top1: 0.03556, throughput: 285.83 | 2023-05-26 08:21:07.324 [rank:7] [train], epoch: 0/50, iter: 600/1000, loss: 0.76877, top1: 0.03425, throughput: 285.80 | 2023-05-26 08:21:07.324 [rank:0] [train], epoch: 0/50, iter: 600/1000, loss: 0.76683, top1: 0.03475, throughput: 285.82[rank:2] [train], epoch: 0/50, iter: 600/1000, loss: 0.76919, top1: 0.03287, throughput: 285.79[rank:5] [train], epoch: 0/50, iter: 600/1000, loss: 0.76843, top1: 0.03237, throughput: 285.79 | 2023-05-26 08:21:07.325| 2023-05-26 08:21:07.324 | 2023-05-26 08:21:07.324 [rank:3] [train], epoch: 0/50, iter: 600/1000, loss: 0.76834, top1: 0.03563, throughput: 285.78 | 2023-05-26 08:21:07.324 [rank:0] [train], epoch: 0/50, iter: 700/1000, loss: 0.75627, top1: 0.04188, throughput: 281.29 | 2023-05-26 08:21:21.544 [rank:6] [train], epoch: 0/50, iter: 700/1000, loss: 0.75528, top1: 0.04062, throughput: 281.28 | 2023-05-26 08:21:21.544 [rank:2] [train], epoch: 0/50, iter: 700/1000, loss: 0.75596, top1: 0.03844, throughput: 281.28 | 2023-05-26 08:21:21.545 [rank:3] [train], epoch: 0/50, iter: 700/1000, loss: 0.75508, top1: 0.04425, throughput: 281.27 | 2023-05-26 08:21:21.545 [rank:4] [train], epoch: 0/50, iter: 700/1000, loss: 0.75610, top1: 0.04125, throughput: 281.25 | 2023-05-26 08:21:21.546 [rank:7] [train], epoch: 0/50, iter: 700/1000, loss: 0.75725, top1: 0.03919, throughput: 281.27 | 2023-05-26 08:21:21.546 [rank:5] [train], epoch: 0/50, iter: 700/1000, loss: 0.75407, top1: 0.04281, throughput: 281.24 | 2023-05-26 08:21:21.548 [rank:1] [train], epoch: 0/50, iter: 700/1000, loss: 0.75588, top1: 0.04056, throughput: 281.21 | 2023-05-26 08:21:21.548 [rank:7] [train], epoch: 0/50, iter: 800/1000, loss: 0.74315, top1: 0.04856, throughput: 280.02 | 2023-05-26 08:21:35.830 [rank:2] [train], epoch: 0/50, iter: 800/1000, loss: 0.74185, top1: 0.04681, throughput: 279.99 | 2023-05-26 08:21:35.831 [rank:0] [train], epoch: 0/50, iter: 800/1000, loss: 0.74356, top1: 0.04856, throughput: 279.97 | 2023-05-26 08:21:35.831 [rank:1] [train], epoch: 0/50, iter: 800/1000, loss: 0.74202, top1: 0.04931, throughput: 280.04 | 2023-05-26 08:21:35.832 [rank:5] [train], epoch: 0/50, iter: 800/1000, loss: 0.74417, top1: 0.04938, throughput: 280.04 | 2023-05-26 08:21:35.832 [rank:3] [train], epoch: 0/50, iter: 800/1000, loss: 0.74433, top1: 0.04813, throughput: 279.98 | 2023-05-26 08:21:35.832 [rank:4] [train], epoch: 0/50, iter: 800/1000, loss: 0.74303, top1: 0.04869, throughput: 279.95 | 2023-05-26 08:21:35.834 [rank:6] [train], epoch: 0/50, iter: 800/1000, loss: 0.74371, top1: 0.04662, throughput: 279.90 | 2023-05-26 08:21:35.835 [rank:6] [train], epoch: 0/50, iter: 900/1000, loss: 0.72924, top1: 0.05775, throughput: 282.84 | 2023-05-26 08:21:49.977 [rank:2] [train], epoch: 0/50, iter: 900/1000, loss: 0.73020, top1: 0.05494, throughput: 282.76 | 2023-05-26 08:21:49.977 [rank:3] [train], epoch: 0/50, iter: 900/1000, loss: 0.72997, top1: 0.05850, throughput: 282.80 | 2023-05-26 08:21:49.976 [rank:7] [train], epoch: 0/50, iter: 900/1000, loss: 0.73131, top1: 0.05394, throughput: 282.74 | 2023-05-26 08:21:49.978 [rank:0] [train], epoch: 0/50, iter: 900/1000, loss: 0.73049, top1: 0.05706, throughput: 282.77 | 2023-05-26 08:21:49.977 [rank:1] [train], epoch: 0/50, iter: 900/1000, loss: 0.72931, top1: 0.06069, throughput: 282.78 | 2023-05-26 08:21:49.977 [rank:5] [train], epoch: 0/50, iter: 900/1000, loss: 0.72920, top1: 0.05713, throughput: 282.73 | 2023-05-26 08:21:49.979 [rank:4] [train], epoch: 0/50, iter: 900/1000, loss: 0.73110, top1: 0.05444, throughput: 282.84 | 2023-05-26 08:21:49.976 [rank:6] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71598, top1: 0.06744, throughput: 279.09 | 2023-05-26 08:22:04.309 [rank:1] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71727, top1: 0.06263, throughput: 279.08 | 2023-05-26 08:22:04.310 [rank:7] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71911, top1: 0.06381, throughput: 279.09 | 2023-05-26 08:22:04.310 [rank:3] [train], epoch: 0/50, iter: 1000/1000, loss: 0.72109, top1: 0.06081, throughput: 279.06 | 2023-05-26 08:22:04.310 [rank:4] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71921, top1: 0.06088, throughput: 279.06 | 2023-05-26 08:22:04.310 [rank:0] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71580, top1: 0.06688, throughput: 279.08 | 2023-05-26 08:22:04.310 [rank:2] [train], epoch: 0/50, iter: 1000/1000, loss: 0.72062, top1: 0.06500, throughput: 279.05 | 2023-05-26 08:22:04.312 [rank:5] [train], epoch: 0/50, iter: 1000/1000, loss: 0.71976, top1: 0.06400, throughput: 279.11 | 2023-05-26 08:22:04.311 F20230526 08:22:06.753672 1001567 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL *** Check failure stack trace: *** F20230526 08:22:06.776768 1001614 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL F20230526 08:22:06.776928 1001622 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL *** Check failure stack trace: *** *** Check failure stack trace: *** F20230526 08:22:06.778252 1001596 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL *** Check failure stack trace: *** F20230526 08:22:06.781836 1001541 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL *** Check failure stack trace: *** @ 0x7f16b72b6e9a google::LogMessage::Fail() F20230526 08:22:06.793205 1001690 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL *** Check failure stack trace: *** F20230526 08:22:06.797328 1001585 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL *** Check failure stack trace: *** F20230526 08:22:06.797571 1001620 normalization_kernel.cu:113] Check failed: 'tensor' Must be non NULL *** Check failure stack trace: *** @ 0x7fac48c13e9a google::LogMessage::Fail() @ 0x7f1aa6fade9a google::LogMessage::Fail() @ 0x7fa4773f0e9a google::LogMessage::Fail() @ 0x7f459cc5ce9a google::LogMessage::Fail() @ 0x7f16b72b9bd1 google::LogMessage::SendToLog() @ 0x7f054fe5fe9a google::LogMessage::Fail() @ 0x7f91a3ae9e9a google::LogMessage::Fail() @ 0x7f1db80fae9a google::LogMessage::Fail() @ 0x7f1aa6fb0bd1 google::LogMessage::SendToLog() @ 0x7fac48c16bd1 google::LogMessage::SendToLog() @ 0x7fa4773f3bd1 google::LogMessage::SendToLog() @ 0x7f459cc5fbd1 google::LogMessage::SendToLog() @ 0x7f16b72b6998 google::LogMessage::Flush() @ 0x7f054fe62bd1 google::LogMessage::SendToLog() @ 0x7f1db80fdbd1 google::LogMessage::SendToLog() @ 0x7f91a3aecbd1 google::LogMessage::SendToLog() @ 0x7f1aa6fad998 google::LogMessage::Flush() @ 0x7fac48c13998 google::LogMessage::Flush() @ 0x7f459cc5c998 google::LogMessage::Flush() @ 0x7fa4773f0998 google::LogMessage::Flush() @ 0x7f16b72ba259 google::LogMessageFatal::~LogMessageFatal() @ 0x7f054fe5f998 google::LogMessage::Flush() @ 0x7f1db80fa998 google::LogMessage::Flush() @ 0x7f91a3ae9998 google::LogMessage::Flush() @ 0x7f1aa6fb1259 google::LogMessageFatal::~LogMessageFatal() @ 0x7fac48c17259 google::LogMessageFatal::~LogMessageFatal() @ 0x7fa4773f4259 google::LogMessageFatal::~LogMessageFatal() @ 0x7f459cc60259 google::LogMessageFatal::~LogMessageFatal() @ 0x7f054fe63259 google::LogMessageFatal::~LogMessageFatal() @ 0x7f1db80fe259 google::LogMessageFatal::~LogMessageFatal() @ 0x7f16b20c9aef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor() @ 0x7f91a3aed259 google::LogMessageFatal::~LogMessageFatal() @ 0x7fa472203aef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor() @ 0x7f1aa1dc0aef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor() @ 0x7f4597a6faef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor() @ 0x7fac43a26aef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor() @ 0x7f1db2f0daef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor() @ 0x7f919e8fcaef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor() @ 0x7f054ac72aef oneflow::(anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor() @ 0x7fa47220a242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute() @ 0x7f1aa1dc7242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute() @ 0x7f16b20d0242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute() @ 0x7fac43a2d242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute() @ 0x7f4597a76242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute() @ 0x7f1db2f14242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute() @ 0x7f919e903242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute() @ 0x7fa46f9034ad oneflow::UserKernel::ForwardUserKernel() @ 0x7f1a9f4c04ad oneflow::UserKernel::ForwardUserKernel() @ 0x7f054ac79242 oneflow::(anonymous namespace)::FusedNormalizationAddReluKernel::Compute() @ 0x7fac411264ad oneflow::UserKernel::ForwardUserKernel() @ 0x7f16af7c94ad oneflow::UserKernel::ForwardUserKernel() @ 0x7f459516f4ad oneflow::UserKernel::ForwardUserKernel() @ 0x7f1db060d4ad oneflow::UserKernel::ForwardUserKernel() @ 0x7f919bffc4ad oneflow::UserKernel::ForwardUserKernel() @ 0x7fa46f90369b oneflow::UserKernel::ForwardDataContent() @ 0x7f1a9f4c069b oneflow::UserKernel::ForwardDataContent() @ 0x7f05483724ad oneflow::UserKernel::ForwardUserKernel() @ 0x7fac4112669b oneflow::UserKernel::ForwardDataContent() @ 0x7f459516f69b oneflow::UserKernel::ForwardDataContent() @ 0x7f16af7c969b oneflow::UserKernel::ForwardDataContent() @ 0x7f1db060d69b oneflow::UserKernel::ForwardDataContent() @ 0x7f919bffc69b oneflow::UserKernel::ForwardDataContent() @ 0x7f054837269b oneflow::UserKernel::ForwardDataContent() @ 0x7f1a9f481c53 oneflow::Kernel::Forward() @ 0x7fa46f8c4c53 oneflow::Kernel::Forward() @ 0x7fac410e7c53 oneflow::Kernel::Forward() @ 0x7f4595130c53 oneflow::Kernel::Forward() @ 0x7f16af78ac53 oneflow::Kernel::Forward() @ 0x7f1db05cec53 oneflow::Kernel::Forward() @ 0x7f919bfbdc53 oneflow::Kernel::Forward() @ 0x7f0548333c53 oneflow::Kernel::Forward() @ 0x7fa46f8c5229 oneflow::Kernel::Launch() @ 0x7f1a9f482229 oneflow::Kernel::Launch() @ 0x7fac410e8229 oneflow::Kernel::Launch() @ 0x7f4595131229 oneflow::Kernel::Launch() @ 0x7fa46fc1c4e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg() @ 0x7f1a9f7d94e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg() @ 0x7fac4143f4e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg() @ 0x7f45954884e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg() @ 0x7f16af78b229 oneflow::Kernel::Launch() @ 0x7f16afae24e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg() @ 0x7f1a9fef9b58 oneflow::Thread::PollMsgChannel() @ 0x7fa47033cb58 oneflow::Thread::PollMsgChannel() @ 0x7f1db05cf229 oneflow::Kernel::Launch() @ 0x7f919bfbe229 oneflow::Kernel::Launch() @ 0x7f4595ba8b58 oneflow::Thread::PollMsgChannel() @ 0x7fac41b5fb58 oneflow::Thread::PollMsgChannel() @ 0x7f0548334229 oneflow::Kernel::Launch() @ 0x7f1a9fefb00e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv @ 0x7fa47033e00e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv @ 0x7f1db09264e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg() @ 0x7f919c3154e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg() @ 0x7f4595baa00e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv @ 0x7fac41b6100e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv @ 0x7f16b0202b58 oneflow::Thread::PollMsgChannel() @ 0x7f054868b4e4 oneflow::(anonymous namespace)::LightActor<>::ProcessMsg() @ 0x7f16b020400e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv @ 0x7f1aa6fc0a70 execute_native_thread_routine @ 0x7fa477403a70 execute_native_thread_routine @ 0x7f1bb0a06609 start_thread @ 0x7fa580e49609 start_thread @ 0x7f1bb07d1133 clone Stack trace (most recent call last) in thread 1001614: @ 0x7fa580c14133 clone Stack trace (most recent call last) in thread 1001596: @ 0x7f459cc6fa70 execute_native_thread_routine @ 0x7fac48c26a70 execute_native_thread_routine @ 0x7f46a66b5609 start_thread @ 0x7fad5266c609 start_thread @ 0x7f46a6480133 clone Stack trace (most recent call last) in thread 1001541: @ 0x7fad52437133 clone Stack trace (most recent call last) in thread 1001622: @ 0x7f16b72c9a70 execute_native_thread_routine @ 0x7f17c0d0f609 start_thread @ 0x7f1db1046b58 oneflow::Thread::PollMsgChannel() @ 0x7f17c0ada133 clone Stack trace (most recent call last) in thread 1001567: @ 0x7f919ca35b58 oneflow::Thread::PollMsgChannel() Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1aa6fc0a6f, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so0x7fa477403a6f", at , in 0x7f1a9fefb00d , in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa47033e00d, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1a9fef9b57, in Thread::PollMsgChannel() Object " Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at ", at 0x7f1a9f7d94e3, in 0x7fa47033cb57 , in Thread::PollMsgChannel() Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa46fc1c4e3 Object ", in /data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so ", at 0x7f1a9f482228, in Kernel::Launch(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1a9f481c52 Object ", in /data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.soKernel::Forward(KernelContext*) const", at 0x7fa46f8c5228, in Kernel::Launch(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so Object "", at /data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so0x7f1a9f4c069a", at , in 0x7fa46f8c4c52UserKernel::ForwardDataContent(KernelContext*) const, in Kernel::Forward(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa46f90369a, in UserKernel::ForwardDataContent(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1a9f4c04ac, in UserKernel::ForwardUserKernel(std::function const&, user_op::OpKernelState*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at Object "0x7f1aa1dc7241/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so, in ", at (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const0x7fa46f9034ac , in UserKernel::ForwardUserKernel(std::function const&, user_op::OpKernelState*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at Object "0x7f1aa1dc0aee/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so, in ", at (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const0x7fa47220a241 , in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1aa6fb1258, in Object " Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at ", at 0x7fa472203aee0x7f1aa6fad997, in , in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const Object " Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at ", at 0x7f1aa6fb0bd00x7fa4773f4258, in , in Object " Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at /data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so0x7f1aa6fade99", at , in 0x7fa4773f0997, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1a988ddebe, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa4773f3bd0, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa4773f0e99, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fa468d20ebe, in Aborted (Signal sent by tkill() 999564 1017) Aborted (Signal sent by tkill() 999563 1017) @ 0x7f0548dabb58 oneflow::Thread::PollMsgChannel() @ 0x7f1db104800e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv @ 0x7f919ca3700e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459cc6fa6f, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4595baa00d, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4595ba8b57, in Thread::PollMsgChannel() Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f45954884e3, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4595131228, in Kernel::Launch(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4595130c52, in Kernel::Forward(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459516f69a, in UserKernel::ForwardDataContent(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459516f4ac, in UserKernel::ForwardUserKernel(std::function const&, user_op::OpKernelState*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4597a76241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f4597a6faee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459cc60258, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459cc5c997, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459cc5fbd0, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f459cc5ce99, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f458e58cebe, in Aborted (Signal sent by tkill() 999566 1017) Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac48c26a6f, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac41b6100d, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac41b5fb57, in Thread::PollMsgChannel() Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac4143f4e3, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac410e8228, in Kernel::Launch(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac410e7c52, in Kernel::Forward(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac4112669a, in UserKernel::ForwardDataContent(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac411264ac, in UserKernel::ForwardUserKernel(std::function const&, user_op::OpKernelState*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac43a2d241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac43a26aee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac48c17258, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac48c13997, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac48c16bd0, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac48c13e99, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7fac3a543ebe, in Aborted (Signal sent by tkill() 999568 1017) @ 0x7f0548dad00e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv @ 0x7f1db810da70 execute_native_thread_routine @ 0x7f91a3afca70 execute_native_thread_routine @ 0x7f1ec1b53609 start_thread @ 0x7f1ec191e133 clone Stack trace (most recent call last) in thread 1001585: @ 0x7f92ad542609 start_thread Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b72c9a6f, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b020400d, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b0202b57, in Thread::PollMsgChannel() Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16afae24e3, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16af78b228, in Kernel::Launch(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16af78ac52, in Kernel::Forward(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16af7c969a, in UserKernel::ForwardDataContent(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16af7c94ac, in UserKernel::ForwardUserKernel(std::function const&, user_op::OpKernelState*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b20d0241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b20c9aee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b72ba258, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b72b6997, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b72b9bd0, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16b72b6e99, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f16a8be6ebe, in Aborted (Signal sent by tkill() 999559 1017) @ 0x7f92ad30d133 clone Stack trace (most recent call last) in thread 1001620: @ 0x7f054fe72a70 execute_native_thread_routine @ 0x7f06598b8609 start_thread @ 0x7f0659683133 clone Stack trace (most recent call last) in thread 1001690: Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db810da6f, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db104800d, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db1046b57, in Thread::PollMsgChannel() Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db09264e3, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db05cf228, in Kernel::Launch(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db05cec52, in Kernel::Forward(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db060d69a, in UserKernel::ForwardDataContent(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db060d4ac, in UserKernel::ForwardUserKernel(std::function const&, user_op::OpKernelState*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db2f14241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db2f0daee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db80fe258, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db80fa997, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db80fdbd0, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1db80fae99, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f1da9a2aebe, in Aborted (Signal sent by tkill() 999560 1017) Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f91a3afca6f, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919ca3700d, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919ca35b57, in Thread::PollMsgChannel() Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919c3154e3, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919bfbe228, in Kernel::Launch(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919bfbdc52, in Kernel::Forward(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919bffc69a, in UserKernel::ForwardDataContent(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919bffc4ac, in UserKernel::ForwardUserKernel(std::function const&, user_op::OpKernelState*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919e903241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f919e8fcaee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f91a3aed258, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f91a3ae9997, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f91a3aecbd0, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f91a3ae9e99, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f9195419ebe, in Aborted (Signal sent by tkill() 999562 1017) Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054fe72a6f, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f0548dad00d, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f0548dabb57, in Thread::PollMsgChannel() Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054868b4e3, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f0548334228, in Kernel::Launch(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f0548333c52, in Kernel::Forward(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054837269a, in UserKernel::ForwardDataContent(KernelContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f05483724ac, in UserKernel::ForwardUserKernel(std::function const&, user_op::OpKernelState*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054ac79241, in (anonymous namespace)::FusedNormalizationAddReluKernel::Compute(user_op::KernelComputeContext*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054ac72aee, in (anonymous namespace)::CudnnTensorDescHelper::CheckParamTensor(user_op::Tensor const*) const Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054fe63258, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054fe5f997, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054fe62bd0, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054fe5fe99, in Object "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/../oneflow.libs/liboneflow-4c797c43.so", at 0x7f054178febe, in Aborted (Signal sent by tkill() 999561 1017) Killing subprocess 999559 Killing subprocess 999560 Killing subprocess 999561 Killing subprocess 999562 Killing subprocess 999563 Killing subprocess 999564 Killing subprocess 999566 Killing subprocess 999568 Traceback (most recent call last): File "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 240, in main() File "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 228, in main sigkill_handler(signal.SIGTERM, None) File "/data/home/zhouhongjun/miniconda3/envs/week_resnet/lib/python3.8/site-packages/oneflow/distributed/launch.py", line 196, in sigkill_handler raise subprocess.CalledProcessError( subprocess.CalledProcessError: Command '['/data/home/zhouhongjun/miniconda3/envs/week_resnet/bin/python3', '-u', '/data/home/zhouhongjun/week_test/models/Vision/classification/image/resnet50/train.py', '--ofrecord-path', '/ssd/dataset/ImageNet/ofrecord', '--ofrecord-part-num', '256', '--num-devices-per-node', '8', '--lr', '1.28', '--momentum', '0.875', '--num-epochs', '50', '--train-batch-size', '40', '--train-global-batch-size', '1280', '--val-batch-size', '20', '--val-global-batch-size', '640', '--print-interval', '100', '--use-fp16', '--channel-last', '--scale-grad', '--graph', '--fuse-bn-relu', '--fuse-bn-add-relu', '--use-gpu-decode']' died with . oneflow-version(git_commit)=0.9.1.dev20230525+cu117 oneflow-commit(git_commit)=08ded68 oneflow-models(git_commit)=fc7cbf8da9b2ee21fa0e9613dd0668c3b45dad4d