NCCL communication aborted: unhandled system error

When I was running distributed training use DeepRec+HybridBackend，after training some steps， encountered the following error: NCCL communication aborted: unhandled system error

System info

GPU model and memory: V100 (PCIe), GPU mem 16GB NCCL version: 2.12.10-1 Docker CUDA/cuDNN version: 11.6.55-1 Machine Driver/CUDA version: Driver Version: 515.65.01 CUDA Version: 11.7 Machine CPU：44 cores Machine Memory：240G Python version: 3.8.10 TensorFlow + DeepRec version: 1.15.5+deeprec2306 HybridBackend hybridbackend-deeprec-cu116 1.0.0

GPU topo info

       GPU0   GPU1  GPU2  GPU3  CPU Affinity    NUMA Affinity
GPU0     X    NODE  SYS  SYS   0,2,4,6,8,10 0
GPU1    NODE   X    SYS  SYS   0,2,4,6,8,10 0
GPU2    SYS   SYS   X    NODE   1,3,5,7,9,11    1
GPU3    SYS   SYS  NODE  X     1,3,5,7,9,11 1

ifconfig info

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1440 inet 10.251.101.11 netmask 255.255.255.255 broadcast 0.0.0.0 ether 72:cd:5c:94:78:fd txqueuelen 0 (Ethernet) RX packets 27019 bytes 107148969 (107.1 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 23625 bytes 2888340 (2.8 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 loop txqueuelen 1 (Local Loopback) RX packets 153724 bytes 1167724791 (1.1 GB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 153724 bytes 1167724791 (1.1 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Trainging info

data-parallel distributed synchronous training，1chief 3workers，use k8s create 4 pods，4gpus in one gpu machine，800GB of data，batch_size 4096

Log info

error throws from https://github.com/DeepRec-AI/HybridBackend/blob/main/hybridbackend/tensorflow/distribute/nccl/nccl_collective.cc chief:

2023/10/26 09:16:30 INFO:tensorflow:Done calling model_fn. 2023/10/26 09:16:30 INFO:tensorflow:Create CheckpointSaverHook. 2023/10/26 09:16:30 INFO:tensorflow:Init incremental saver , incremental_save:False, incremental_path:../sug_ple_model/.incremental_checkpoint/incremental_model.ckpt 2023/10/26 09:16:49 INFO:tensorflow:Graph was finalized. 2023/10/26 09:16:49 INFO:tensorflow:run without loading checkpoint 2023/10/26 09:17:03 job-1111-2222-0:293:489 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.120.90<0> 2023/10/26 09:17:03 job-1111-2222-0:293:489 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). 2023/10/26 09:17:03 job-1111-2222-0:293:489 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. 2023/10/26 09:17:04 2023-10-26 09:17:04.082979: I ./tensorflow/core/framework/ev_allocator.h:555] EVAllocator set arena array size: 128 2023/10/26 09:17:24 INFO:tensorflow:Running local_init_op. 2023/10/26 09:17:26 job-1111-2222-0:293:467 [0] NCCL INFO Bootstrap : Using eth0:10.241.120.90<0> 2023/10/26 09:17:27 job-1111-2222-0:293:467 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation 2023/10/26 09:17:27 job-1111-2222-0:293:467 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. 2023/10/26 09:17:27 job-1111-2222-0:293:467 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.120.90<0> 2023/10/26 09:17:27 job-1111-2222-0:293:467 [0] NCCL INFO Using network Socket 2023/10/26 09:17:29 NCCL version 2.12.10+cuda11.6 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Setting affinity for GPU 0 to aaaa,aaaaaaaa 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 00/02 : 0 1 2 3 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 01/02 : 0 1 2 3 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] -1/-1/-1->0->1 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 00/0 : 3[b1000] -> 0[d9000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 01/0 : 3[b1000] -> 0[d9000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 00/0 : 0[d9000] -> 1[5e000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 01/0 : 0[d9000] -> 1[5e000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Connected all rings 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 00/0 : 2[3b000] -> 0[d9000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 00/0 : 0[d9000] -> 2[3b000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 01/0 : 1[5e000] -> 0[d9000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Connected all trees 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO comm 0x7fda2c015e20 rank 0 nranks 4 cudaDev 0 busId d9000 - Init COMPLETE 2023/10/26 09:17:29 INFO:tensorflow:Done running local_init_op. 2023/10/26 09:18:02 INFO:tensorflow:Saving checkpoints for 0 into ../sug_ple_model/model.ckpt. 2023/10/26 09:18:11 INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None 2023/10/26 09:19:04 2023-10-26 09:19:04.046545: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.049190: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.049300: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.049833: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 job-1111-2222-0:293:762 [0] NCCL INFO Launch mode Parallel 2023/10/26 09:19:04 2023-10-26 09:19:04.059969: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.063592: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.086145: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.088319: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.093784: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.094269: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.282315: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublas.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.176644: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublasLt.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.176751: I tensorflow/stream_executor/cuda/cuda_blas.cc:239] Create cuBLASLt success. 2023/10/26 09:19:05 INFO:tensorflow:loss = 0.1913051, step = 0 2023/10/26 09:19:05 INFO:tensorflow:ctctr_loss = 0.024605626, ctcvr_loss = 0.0008409239, ctr_loss = 0.16585855, loss = 0.1913051 2023/10/26 09:25:41 2023-10-26 09:25:41.450087: I tensorflow/core/common_runtime/tensorpool_allocator.cc:146] TensorPoolAllocator enabled 2023/10/26 09:33:05 INFO:tensorflow:global_step/sec: 0.119048 2023/10/26 09:33:05 INFO:tensorflow:loss = 0.046956144, step = 100 (839.986 sec) 2023/10/26 09:33:05 INFO:tensorflow:ctctr_loss = 0.016574726, ctcvr_loss = 0.00023758807, ctr_loss = 0.03014383, loss = 0.046956144 (839.985 sec) 2023/10/26 09:46:52 INFO:tensorflow:global_step/sec: 0.120919 2023/10/26 09:46:52 INFO:tensorflow:loss = 0.03203261, step = 200 (827.000 sec) 2023/10/26 09:46:52 INFO:tensorflow:ctctr_loss = 0.013954056, ctcvr_loss = 0.00028846407, ctr_loss = 0.01779009, loss = 0.03203261 (827.000 sec) 2023/10/26 10:00:37 INFO:tensorflow:global_step/sec: 0.12121 2023/10/26 10:00:37 INFO:tensorflow:loss = 0.031852365, step = 300 (825.013 sec) 2023/10/26 10:00:37 INFO:tensorflow:ctctr_loss = 0.014548324, ctcvr_loss = 5.2825588e-05, ctr_loss = 0.017251218, loss = 0.031852365 (825.014 sec) 2023/10/26 10:15:51 INFO:tensorflow:global_step/sec: 0.109466 2023/10/26 10:15:51 INFO:tensorflow:loss = 0.024153532, step = 400 (913.527 sec) 2023/10/26 10:15:51 INFO:tensorflow:ctctr_loss = 0.010566762, ctcvr_loss = 0.0004321469, ctr_loss = 0.013154624, loss = 0.024153532 (913.527 sec) 2023/10/26 10:29:32 INFO:tensorflow:global_step/sec: 0.121771 2023/10/26 10:29:32 INFO:tensorflow:loss = 0.03749844, step = 500 (821.217 sec) 2023/10/26 10:29:32 INFO:tensorflow:ctctr_loss = 0.017164528, ctcvr_loss = 0.00012922783, ctr_loss = 0.020204684, loss = 0.03749844 (821.217 sec) 2023/10/26 10:43:21 INFO:tensorflow:global_step/sec: 0.120586 2023/10/26 10:43:21 INFO:tensorflow:loss = 0.031007526, step = 600 (829.286 sec) 2023/10/26 10:43:21 INFO:tensorflow:ctctr_loss = 0.014418589, ctcvr_loss = 0.00030601368, ctr_loss = 0.016282924, loss = 0.031007526 (829.286 sec) 2023/10/26 10:57:14 INFO:tensorflow:global_step/sec: 0.120088 2023/10/26 10:57:14 INFO:tensorflow:loss = 0.036426634, step = 700 (832.720 sec) 2023/10/26 10:57:14 INFO:tensorflow:ctctr_loss = 0.016776908, ctcvr_loss = 0.000521396, ctr_loss = 0.01912833, loss = 0.036426634 (832.720 sec) 2023/10/26 11:10:51 INFO:tensorflow:global_step/sec: 0.122373 2023/10/26 11:10:51 INFO:tensorflow:loss = 0.03228343, step = 800 (817.172 sec) 2023/10/26 11:10:51 INFO:tensorflow:ctctr_loss = 0.014936367, ctcvr_loss = 0.00016124125, ctr_loss = 0.017185822, loss = 0.03228343 (817.172 sec) 2023/10/26 11:24:24 INFO:tensorflow:global_step/sec: 0.123052 2023/10/26 11:24:24 INFO:tensorflow:loss = 0.032250684, step = 900 (812.666 sec) 2023/10/26 11:24:24 INFO:tensorflow:ctctr_loss = 0.014788659, ctcvr_loss = 0.0004015968, ctr_loss = 0.017060427, loss = 0.032250684 (812.665 sec) 2023/10/26 11:38:02 INFO:tensorflow:Saving checkpoints for 1000 into ../sug_ple_model/model.ckpt. 2023/10/26 11:38:20 INFO:tensorflow:global_step/sec: 0.119588 2023/10/26 11:38:20 INFO:tensorflow:loss = 0.033876635, step = 1000 (836.209 sec) 2023/10/26 11:38:20 INFO:tensorflow:ctctr_loss = 0.015491962, ctcvr_loss = 0.0004009509, ctr_loss = 0.017983723, loss = 0.033876635 (836.209 sec) 2023/10/26 11:52:15 INFO:tensorflow:global_step/sec: 0.119757 2023/10/26 11:52:15 INFO:tensorflow:loss = 0.03441868, step = 1100 (835.023 sec) 2023/10/26 11:52:15 INFO:tensorflow:ctctr_loss = 0.015090618, ctcvr_loss = 0.0005589067, ctr_loss = 0.018769156, loss = 0.03441868 (835.023 sec) 2023/10/26 12:05:50 INFO:tensorflow:global_step/sec: 0.122696 2023/10/26 12:05:50 INFO:tensorflow:loss = 0.03474028, step = 1200 (815.024 sec) 2023/10/26 12:05:50 INFO:tensorflow:ctctr_loss = 0.015631478, ctcvr_loss = 0.0003598568, ctr_loss = 0.018748946, loss = 0.03474028 (815.023 sec) 2023/10/26 12:12:05 2023/10/26 12:12:05 job-1111-2222-0:293:765 [0] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer job-1111-2222-3.service-1111-2222.autopai.svc.cluster.local<53572> 2023/10/26 12:12:06 job-1111-2222-0:293:765 [0] NCCL INFO transport/net_socket.cc:493 -> 2 2023/10/26 12:12:06 job-1111-2222-0:293:765 [0] NCCL INFO include/net.h:32 -> 2 2023/10/26 12:12:06 job-1111-2222-0:293:765 [0] NCCL INFO transport/net.cc:996 -> 2 2023/10/26 12:12:06 job-1111-2222-0:293:765 [0] NCCL INFO proxy.cc:494 -> 2 2023/10/26 12:12:06 job-1111-2222-0:293:765 [0] NCCL INFO proxy.cc:614 -> 2 [Proxy Thread] 2023/10/26 12:12:06 2023-10-26 12:12:06.704113: E hybridbackend/tensorflow/distribute/nccl/nccl_collective.cc:461] NCCL communication aborted: unhandled system error 2023/10/26 12:12:07 job-1111-2222-0:293:763 [0] NCCL INFO comm 0x7fda2c015e20 rank 0 nranks 4 cudaDev 0 busId d9000 - Abort COMPLETE 2023/10/26 12:12:07 2023-10-26 12:12:07.399692: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_create.cc:112 : Internal: unhandled system error 2023/10/26 12:12:20 /usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'tensorflow.python.distribute.launch' found in sys.modules after import of package 'tensorflow.python.distribute', but prior to execution of 'tensorflow.python.distribute.launch'; this may result in unpredictable behaviour 2023/10/26 12:12:20 warn(RuntimeWarning(msg)) 2023/10/26 12:12:20 Traceback (most recent call last): 2023/10/26 12:12:20 File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main 2023/10/26 12:12:20 return _run_code(code, main_globals, None, 2023/10/26 12:12:20 File "/usr/lib/python3.8/runpy.py", line 87, in _run_code 2023/10/26 12:12:20 exec(code, run_globals) 2023/10/26 12:12:20 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 319, in 2023/10/26 12:12:20 launch([args.command] + args.args) 2023/10/26 12:12:20 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 97, in launch 2023/10/26 12:12:20 subprocess.check_call(command) 2023/10/26 12:12:20 File "/usr/lib/python3.8/subprocess.py", line 364, in check_call 2023/10/26 12:12:20 raise CalledProcessError(retcode, cmd) 2023/10/26 12:12:20 subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'run.py', '--worker_index=0', '--worker_size=4', '--protocol=grpc']' died with <Signals.SIGSEGV: 11>. 2023/10/26 12:12:20 start launching tensorflow job 2023/10/26 12:12:20 {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":0,"type":"chief"}} 2023/10/26 12:12:20 tf_args: ['--worker_index=0', '--worker_size=4', '--protocol=grpc'] 2023/10/26 12:12:20 TF_CONFIG= {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":0,"type":"chief"}} 2023/10/26 12:12:20 Start to gen endpoint_map file. 2023/10/26 12:12:20 node_list: ['job-1111-2222-0.service-1111-2222:2222', 'job-1111-2222-1.service-1111-2222:2222', 'job-1111-2222-2.service-1111-2222:2222', 'job-1111-2222-3.service-1111-2222:2222'] 2023/10/26 12:12:20 run tensorflow command: cd ./model; COLLECTIVE_STRATEGY=hb $(which python) -m tensorflow.python.distribute.launch $(which python) -u run.py --worker_index=0 --worker_size=4 --protocol=grpc 2023/10/26 12:12:20 ret_code: 1 2023/10/26 12:12:21 train error code:0 2023/10/26 12:12:21 exec result:1

worker:

2023/10/26 09:16:39 INFO:tensorflow:Done calling model_fn. 2023/10/26 09:16:39 INFO:tensorflow:Create CheckpointSaverHook. 2023/10/26 09:16:39 INFO:tensorflow:Init incremental saver , incremental_save:False, incremental_path:/tmp/tmpac24awf3/.incremental_checkpoint/incremental_model.ckpt 2023/10/26 09:16:58 INFO:tensorflow:Graph was finalized. 2023/10/26 09:16:58 INFO:tensorflow:run without loading checkpoint 2023/10/26 09:17:10 2023-10-26 09:17:10.584288: I ./tensorflow/core/framework/ev_allocator.h:555] EVAllocator set arena array size: 128 2023/10/26 09:17:25 INFO:tensorflow:Running local_init_op. 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Bootstrap : Using eth0:10.241.120.67<0> 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.120.67<0> 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Using network Socket 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Trees [0] 1/3/-1->2->0 [1] -1/-1/-1->2->1 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 1[5e000] -> 2[3b000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 01/0 : 1[5e000] -> 2[3b000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 2[3b000] -> 3[b1000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 01/0 : 2[3b000] -> 3[b1000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Connected all rings 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 0[d9000] -> 2[3b000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 2[3b000] -> 0[d9000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 3[b1000] -> 2[3b000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 2[3b000] -> 1[5e000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 01/0 : 2[3b000] -> 1[5e000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Connected all trees 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO comm 0x7f0e1b123750 rank 2 nranks 4 cudaDev 0 busId 3b000 - Init COMPLETE 2023/10/26 09:17:29 INFO:tensorflow:Done running local_init_op. 2023/10/26 09:18:00 INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpac24awf3/model.ckpt. 2023/10/26 09:18:06 INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None 2023/10/26 09:19:04 2023-10-26 09:19:04.065744: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.065939: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.066213: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.066313: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.084363: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.084621: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.088631: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.111988: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.112148: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.112225: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.289962: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublas.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.170070: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublasLt.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.170152: I tensorflow/stream_executor/cuda/cuda_blas.cc:239] Create cuBLASLt success. 2023/10/26 09:19:05 INFO:tensorflow:loss = 0.19766834, step = 0 2023/10/26 09:19:05 INFO:tensorflow:ctctr_loss = 0.035726123, ctcvr_loss = 0.0010907989, ctr_loss = 0.16085143, loss = 0.19766834 2023/10/26 09:25:51 2023-10-26 09:25:51.287298: I tensorflow/core/common_runtime/tensorpool_allocator.cc:146] TensorPoolAllocator enabled 2023/10/26 09:33:05 INFO:tensorflow:global_step/sec: 0.119053 2023/10/26 09:33:05 INFO:tensorflow:loss = 0.046693888, step = 100 (839.960 sec) 2023/10/26 09:33:05 INFO:tensorflow:ctctr_loss = 0.016436137, ctcvr_loss = 0.0005084026, ctr_loss = 0.029749347, loss = 0.046693888 (839.960 sec) 2023/10/26 09:46:52 INFO:tensorflow:global_step/sec: 0.120922 2023/10/26 09:46:52 INFO:tensorflow:loss = 0.029226346, step = 200 (826.981 sec) 2023/10/26 09:46:52 INFO:tensorflow:ctctr_loss = 0.012963342, ctcvr_loss = 0.00040178455, ctr_loss = 0.01586122, loss = 0.029226346 (826.981 sec) 2023/10/26 10:00:37 INFO:tensorflow:global_step/sec: 0.12121 2023/10/26 10:00:37 INFO:tensorflow:loss = 0.042071566, step = 300 (825.018 sec) 2023/10/26 10:00:37 INFO:tensorflow:ctctr_loss = 0.019438155, ctcvr_loss = 0.0003417362, ctr_loss = 0.022291675, loss = 0.042071566 (825.018 sec) 2023/10/26 10:15:51 INFO:tensorflow:global_step/sec: 0.109466 2023/10/26 10:15:51 INFO:tensorflow:loss = 0.034424614, step = 400 (913.527 sec) 2023/10/26 10:15:51 INFO:tensorflow:ctctr_loss = 0.015548481, ctcvr_loss = 0.0001899354, ctr_loss = 0.018686198, loss = 0.034424614 (913.527 sec) 2023/10/26 10:29:32 INFO:tensorflow:global_step/sec: 0.121767 2023/10/26 10:29:32 INFO:tensorflow:loss = 0.030513769, step = 500 (821.242 sec) 2023/10/26 10:29:32 INFO:tensorflow:ctctr_loss = 0.013715225, ctcvr_loss = 0.00022438697, ctr_loss = 0.016574156, loss = 0.030513769 (821.242 sec) 2023/10/26 10:43:21 INFO:tensorflow:global_step/sec: 0.120589 2023/10/26 10:43:21 INFO:tensorflow:loss = 0.02965854, step = 600 (829.263 sec) 2023/10/26 10:43:21 INFO:tensorflow:ctctr_loss = 0.013955852, ctcvr_loss = 0.00019742409, ctr_loss = 0.015505264, loss = 0.02965854 (829.263 sec) 2023/10/26 10:57:14 INFO:tensorflow:global_step/sec: 0.120092 2023/10/26 10:57:14 INFO:tensorflow:loss = 0.028629225, step = 700 (832.693 sec) 2023/10/26 10:57:14 INFO:tensorflow:ctctr_loss = 0.012916515, ctcvr_loss = 0.00010687506, ctr_loss = 0.015605834, loss = 0.028629225 (832.693 sec) 2023/10/26 11:10:51 INFO:tensorflow:global_step/sec: 0.12237 2023/10/26 11:10:51 INFO:tensorflow:loss = 0.03414993, step = 800 (817.193 sec) 2023/10/26 11:10:51 INFO:tensorflow:ctctr_loss = 0.015847603, ctcvr_loss = 0.00011763776, ctr_loss = 0.01818469, loss = 0.03414993 (817.193 sec) 2023/10/26 11:24:24 INFO:tensorflow:global_step/sec: 0.123054 2023/10/26 11:24:24 INFO:tensorflow:loss = 0.035816602, step = 900 (812.650 sec) 2023/10/26 11:24:24 INFO:tensorflow:ctctr_loss = 0.016396968, ctcvr_loss = 0.00031063452, ctr_loss = 0.019109001, loss = 0.035816602 (812.650 sec) 2023/10/26 11:38:02 INFO:tensorflow:Saving checkpoints for 1000 into /tmp/tmpac24awf3/model.ckpt. 2023/10/26 11:38:20 INFO:tensorflow:global_step/sec: 0.119587 2023/10/26 11:38:20 INFO:tensorflow:loss = 0.03361716, step = 1000 (836.215 sec) 2023/10/26 11:38:20 INFO:tensorflow:ctctr_loss = 0.015267236, ctcvr_loss = 0.00011550254, ctr_loss = 0.018234422, loss = 0.03361716 (836.215 sec) 2023/10/26 11:52:15 INFO:tensorflow:global_step/sec: 0.119753 2023/10/26 11:52:15 INFO:tensorflow:loss = 0.02796927, step = 1100 (835.049 sec) 2023/10/26 11:52:15 INFO:tensorflow:ctctr_loss = 0.012550943, ctcvr_loss = 0.00030419877, ctr_loss = 0.0151141295, loss = 0.02796927 (835.049 sec) 2023/10/26 12:05:50 INFO:tensorflow:global_step/sec: 0.122703 2023/10/26 12:05:50 INFO:tensorflow:loss = 0.034918174, step = 1200 (814.979 sec) 2023/10/26 12:05:50 INFO:tensorflow:ctctr_loss = 0.015425372, ctcvr_loss = 0.00069285027, ctr_loss = 0.018799951, loss = 0.034918174 (814.979 sec) 2023/10/26 12:12:04 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer job-1111-2222-1.service-1111-2222.autopai.svc.cluster.local<34106> 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] NCCL INFO transport/net_socket.cc:493 -> 2 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] NCCL INFO include/net.h:32 -> 2 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] NCCL INFO transport/net.cc:996 -> 2 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] NCCL INFO proxy.cc:494 -> 2 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] NCCL INFO proxy.cc:614 -> 2 [Proxy Thread] 2023/10/26 12:12:05 2023-10-26 12:12:05.022824: E hybridbackend/tensorflow/distribute/nccl/nccl_collective.cc:461] NCCL communication aborted: unhandled system error 2023/10/26 12:12:05 job-1111-2222-2:292:698 [0] NCCL INFO comm 0x7f0e1b123750 rank 2 nranks 4 cudaDev 0 busId 3b000 - Abort COMPLETE 2023/10/26 12:12:05 2023/10/26 12:12:05 job-1111-2222-2:292:696 [0] misc/argcheck.cc:39 NCCL WARN AllGather : invalid root 0 (root should be in the 0..-1 range) 2023/10/26 12:12:05 job-1111-2222-2:292:696 [0] NCCL INFO enqueue.cc:1342 -> 4 2023/10/26 12:12:05 2023-10-26 12:12:05.655401: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_create.cc:112 : Internal: unhandled system error 2023/10/26 12:12:05 2023-10-26 12:12:05.655431: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_allgatherv.cc:130 : Internal: invalid argument 2023/10/26 12:12:18 /usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'tensorflow.python.distribute.launch' found in sys.modules after import of package 'tensorflow.python.distribute', but prior to execution of 'tensorflow.python.distribute.launch'; this may result in unpredictable behaviour 2023/10/26 12:12:18 warn(RuntimeWarning(msg)) 2023/10/26 12:12:18 Traceback (most recent call last): 2023/10/26 12:12:18 File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main 2023/10/26 12:12:18 return _run_code(code, main_globals, None, 2023/10/26 12:12:18 File "/usr/lib/python3.8/runpy.py", line 87, in _run_code 2023/10/26 12:12:18 exec(code, run_globals) 2023/10/26 12:12:18 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 319, in 2023/10/26 12:12:18 launch([args.command] + args.args) 2023/10/26 12:12:18 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 97, in launch 2023/10/26 12:12:18 subprocess.check_call(command) 2023/10/26 12:12:18 File "/usr/lib/python3.8/subprocess.py", line 364, in check_call 2023/10/26 12:12:18 raise CalledProcessError(retcode, cmd) 2023/10/26 12:12:18 subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'run.py', '--worker_index=2', '--worker_size=4', '--protocol=grpc']' died with <Signals.SIGSEGV: 11>. 2023/10/26 12:12:19 start launching tensorflow job 2023/10/26 12:12:19 {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":1,"type":"worker"}} 2023/10/26 12:12:19 tf_args: ['--worker_index=2', '--worker_size=4', '--protocol=grpc'] 2023/10/26 12:12:19 TF_CONFIG= {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":1,"type":"worker"}} 2023/10/26 12:12:19 Start to gen endpoint_map file. 2023/10/26 12:12:19 node_list: ['job-1111-2222-0.service-1111-2222:2222', 'job-1111-2222-1.service-1111-2222:2222', 'job-1111-2222-2.service-1111-2222:2222', 'job-1111-2222-3.service-1111-2222:2222'] 2023/10/26 12:12:19 run tensorflow command: cd ./model; COLLECTIVE_STRATEGY=hb $(which python) -m tensorflow.python.distribute.launch $(which python) -u run.py --worker_index=2 --worker_size=4 --protocol=grpc 2023/10/26 12:12:19 ret_code: 1 2023/10/26 12:12:19 train error code:0 2023/10/26 12:12:19 exec result:1 2023/10/26 12:12:19 fail\t2222\t1111\t-1\t-1 2023/10/26 12:12:19 updateStatusForSirius param:{"jobId":"1111","creator":"cmp","historyId":"2222","podIndex":"2","status":"fail"}

nccl-tests

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   3423 on job-1111-2222-0 device  0 [0x3b] Tesla V100-PCIE-16GB
#  Rank  1 Group  0 Pid   3423 on job-1111-2222-0 device  1 [0x5e] Tesla V100-PCIE-16GB
#  Rank  2 Group  0 Pid   3423 on job-1111-2222-0 device  2 [0xb1] Tesla V100-PCIE-16GB
#  Rank  3 Group  0 Pid   3423 on job-1111-2222-0 device  3 [0xd9] Tesla V100-PCIE-16GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    19.12    0.00    0.00      0    18.26    0.00    0.00      0
          16             4     float     sum      -1    17.72    0.00    0.00      0    19.53    0.00    0.00      0
          32             8     float     sum      -1    18.08    0.00    0.00      0    17.74    0.00    0.00      0
          64            16     float     sum      -1    18.59    0.00    0.01      0    17.65    0.00    0.01      0
         128            32     float     sum      -1    18.15    0.01    0.01      0    17.58    0.01    0.01      0
         256            64     float     sum      -1    19.89    0.01    0.02      0    18.57    0.01    0.02      0
         512           128     float     sum      -1    17.70    0.03    0.04      0    19.14    0.03    0.04      0
        1024           256     float     sum      -1    17.78    0.06    0.09      0    17.46    0.06    0.09      0
        2048           512     float     sum      -1    18.99    0.11    0.16      0    17.43    0.12    0.18      0
        4096          1024     float     sum      -1    18.63    0.22    0.33      0    17.65    0.23    0.35      0
        8192          2048     float     sum      -1    20.03    0.41    0.61      0    18.99    0.43    0.65      0
       16384          4096     float     sum      -1    22.03    0.74    1.12      0    21.80    0.75    1.13      0
       32768          8192     float     sum      -1    27.30    1.20    1.80      0    27.11    1.21    1.81      0
       65536         16384     float     sum      -1    39.13    1.67    2.51      0    38.73    1.69    2.54      0
      131072         32768     float     sum      -1    63.23    2.07    3.11      0    63.13    2.08    3.11      0
      262144         65536     float     sum      -1    84.84    3.09    4.63      0    84.94    3.09    4.63      0
      524288        131072     float     sum      -1    131.6    3.98    5.97      0    131.6    3.98    5.97      0
     1048576        262144     float     sum      -1    215.1    4.87    7.31      0    214.7    4.88    7.33      0
     2097152        524288     float     sum      -1    382.1    5.49    8.23      0    381.7    5.49    8.24      0
     4194304       1048576     float     sum      -1    718.0    5.84    8.76      0    717.5    5.85    8.77      0
     8388608       2097152     float     sum      -1   1400.6    5.99    8.98      0   1397.2    6.00    9.01      0
    16777216       4194304     float     sum      -1   2757.2    6.08    9.13      0   2758.8    6.08    9.12      0
    33554432       8388608     float     sum      -1   5530.7    6.07    9.10      0   5499.9    6.10    9.15      0
    67108864      16777216     float     sum      -1    10969    6.12    9.18      0    10978    6.11    9.17      0
   134217728      33554432     float     sum      -1    21873    6.14    9.20      0    21910    6.13    9.19      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.61665

gdb core file info

#0  0x00007fea58623a11 in tensorflow::(anonymous namespace)::ExecutorState<tensorflow::PropagatorState>::ProcessAsync(tensorflow::NodeItem const&, tensorflow::OpKernelContext::Params const&, tensorflow::PropagatorState::TaggedNode const&, tensorflow::Entry*, tensorflow::NodeExecStatsInterface*)::{lambda()#1}::operator()() const ()
   from /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#1  0x00007fe9d30b1da9 in std::function<void ()>::operator()() const (this=0x7fe7f800ee78) at /usr/include/c++/9/bits/std_function.h:683
#2  tensorflow::hybridbackend::CreateNcclCollectiveOp::ComputeAsync(tensorflow::OpKernelContext*, std::function<void ()>)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const (__closure=0x7fe7f800ee70) at hybridbackend/tensorflow/distribute/nccl/nccl_create.cc:112
#3  std::_Function_handler<void (), tensorflow::hybridbackend::CreateNcclCollectiveOp::ComputeAsync(tensorflow::OpKernelContext*, std::function<void ()>)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#4  0x00007fe9d30dc7f5 in std::function<void ()>::operator()() const (this=0x7fe800028380) at /usr/include/c++/9/bits/std_function.h:683
#5  tensorflow::hybridbackend::Stream::<lambda()>::operator() (__closure=0x7fe800028370) at hybridbackend/tensorflow/common/stream.cc:79
#6  std::_Function_handler<void(), tensorflow::hybridbackend::Stream::Launch(tensorflow::OpKernelContext*, std::function<void()>)::<lambda()> >::_M_invoke(const std::_Any_data &) (
    __functor=...) at /usr/include/c++/9/bits/std_function.h:300
#7  0x00007fea5873078d in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
   from /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#8  0x00007fea5872b74c in std::_Function_handler<void (), tensorflow::

Some additional information below.

The error is raised from here: https://github.com/DeepRec-AI/HybridBackend/blob/4486ba138515a1dbdb6f7d542d7ad23a27476524/hybridbackend/tensorflow/distribute/nccl/nccl_collective.cc#L456

it is a ncclSystemError according to https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.ncclResult_t

already asked the customer

turn on the debug: NCCL_DEBUG=INFO
set NCCL_SOCKET_IFNAME to eth
turn off firewall

@sjeaugey would you please help review and give some directions? thank you.

Somehow NCCL comm is aborted by the application. Do you know in which case Tensorflow will abort communicators (e.g. when time is out)?

Can you provide me the log with NCCL_DEBUG=INFO?

Somehow NCCL comm is aborted by the application. Do you know in which case Tensorflow will abort communicators (e.g. when time is out)?

Can you provide me the log with NCCL_DEBUG=INFO?

thanks for your reply the error will occur when rerun，every time it happens in a different step，cpu gpu and memory in the low utilization rate error code from：

train_spec = tf.estimator.TrainSpec(
            input_fn=make_input_fn(train_files, batch_size=Configs['batch_size'], num_epochs=Configs['epochs'],
                                   shuffle=Configs['shuffle']), max_steps=Configs['steps'])
eval_spec = tf.estimator.EvalSpec(
            input_fn=make_input_fn(eval_files, batch_size=Configs['batch_size'], num_epochs=1, shuffle=Configs['shuffle']))
estimator.train_and_evaluate(train_spec, eval_spec)

log with NCCL_DEBUG=INFO： chief:

2023/10/26 09:16:30 INFO:tensorflow:Done calling model_fn. 2023/10/26 09:16:30 INFO:tensorflow:Create CheckpointSaverHook. 2023/10/26 09:16:30 INFO:tensorflow:Init incremental saver , incremental_save:False, incremental_path:../sug_ple_model/.incremental_checkpoint/incremental_model.ckpt 2023/10/26 09:16:49 INFO:tensorflow:Graph was finalized. 2023/10/26 09:16:49 INFO:tensorflow:run without loading checkpoint 2023/10/26 09:17:03 job-1111-2222-0:293:489 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.120.90<0> 2023/10/26 09:17:03 job-1111-2222-0:293:489 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so). 2023/10/26 09:17:03 job-1111-2222-0:293:489 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. 2023/10/26 09:17:04 2023-10-26 09:17:04.082979: I ./tensorflow/core/framework/ev_allocator.h:555] EVAllocator set arena array size: 128 2023/10/26 09:17:24 INFO:tensorflow:Running local_init_op. 2023/10/26 09:17:26 job-1111-2222-0:293:467 [0] NCCL INFO Bootstrap : Using eth0:10.241.120.90<0> 2023/10/26 09:17:27 job-1111-2222-0:293:467 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation 2023/10/26 09:17:27 job-1111-2222-0:293:467 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. 2023/10/26 09:17:27 job-1111-2222-0:293:467 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.120.90<0> 2023/10/26 09:17:27 job-1111-2222-0:293:467 [0] NCCL INFO Using network Socket 2023/10/26 09:17:29 NCCL version 2.12.10+cuda11.6 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Setting affinity for GPU 0 to aaaa,aaaaaaaa 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 00/02 : 0 1 2 3 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 01/02 : 0 1 2 3 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] -1/-1/-1->0->1 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 00/0 : 3[b1000] -> 0[d9000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 01/0 : 3[b1000] -> 0[d9000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 00/0 : 0[d9000] -> 1[5e000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 01/0 : 0[d9000] -> 1[5e000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Connected all rings 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 00/0 : 2[3b000] -> 0[d9000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 00/0 : 0[d9000] -> 2[3b000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Channel 01/0 : 1[5e000] -> 0[d9000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO Connected all trees 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer 2023/10/26 09:17:29 job-1111-2222-0:293:763 [0] NCCL INFO comm 0x7fda2c015e20 rank 0 nranks 4 cudaDev 0 busId d9000 - Init COMPLETE 2023/10/26 09:17:29 INFO:tensorflow:Done running local_init_op. 2023/10/26 09:18:02 INFO:tensorflow:Saving checkpoints for 0 into ../sug_ple_model/model.ckpt. 2023/10/26 09:18:11 INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None 2023/10/26 09:19:04 2023-10-26 09:19:04.046545: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.049190: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.049300: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.049833: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 job-1111-2222-0:293:762 [0] NCCL INFO Launch mode Parallel 2023/10/26 09:19:04 2023-10-26 09:19:04.059969: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.063592: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.086145: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.088319: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.093784: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.094269: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.282315: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublas.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.176644: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublasLt.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.176751: I tensorflow/stream_executor/cuda/cuda_blas.cc:239] Create cuBLASLt success. 2023/10/26 09:19:05 INFO:tensorflow:loss = 0.1913051, step = 0 2023/10/26 09:19:05 INFO:tensorflow:ctctr_loss = 0.024605626, ctcvr_loss = 0.0008409239, ctr_loss = 0.16585855, loss = 0.1913051 2023/10/26 09:25:41 2023-10-26 09:25:41.450087: I tensorflow/core/common_runtime/tensorpool_allocator.cc:146] TensorPoolAllocator enabled 2023/10/26 09:33:05 INFO:tensorflow:global_step/sec: 0.119048 2023/10/26 09:33:05 INFO:tensorflow:loss = 0.046956144, step = 100 (839.986 sec) 2023/10/26 09:33:05 INFO:tensorflow:ctctr_loss = 0.016574726, ctcvr_loss = 0.00023758807, ctr_loss = 0.03014383, loss = 0.046956144 (839.985 sec) 2023/10/26 09:46:52 INFO:tensorflow:global_step/sec: 0.120919 2023/10/26 09:46:52 INFO:tensorflow:loss = 0.03203261, step = 200 (827.000 sec) 2023/10/26 09:46:52 INFO:tensorflow:ctctr_loss = 0.013954056, ctcvr_loss = 0.00028846407, ctr_loss = 0.01779009, loss = 0.03203261 (827.000 sec) 2023/10/26 10:00:37 INFO:tensorflow:global_step/sec: 0.12121 2023/10/26 10:00:37 INFO:tensorflow:loss = 0.031852365, step = 300 (825.013 sec) 2023/10/26 10:00:37 INFO:tensorflow:ctctr_loss = 0.014548324, ctcvr_loss = 5.2825588e-05, ctr_loss = 0.017251218, loss = 0.031852365 (825.014 sec) 2023/10/26 10:15:51 INFO:tensorflow:global_step/sec: 0.109466 2023/10/26 10:15:51 INFO:tensorflow:loss = 0.024153532, step = 400 (913.527 sec) 2023/10/26 10:15:51 INFO:tensorflow:ctctr_loss = 0.010566762, ctcvr_loss = 0.0004321469, ctr_loss = 0.013154624, loss = 0.024153532 (913.527 sec) 2023/10/26 10:29:32 INFO:tensorflow:global_step/sec: 0.121771 2023/10/26 10:29:32 INFO:tensorflow:loss = 0.03749844, step = 500 (821.217 sec) 2023/10/26 10:29:32 INFO:tensorflow:ctctr_loss = 0.017164528, ctcvr_loss = 0.00012922783, ctr_loss = 0.020204684, loss = 0.03749844 (821.217 sec) 2023/10/26 10:43:21 INFO:tensorflow:global_step/sec: 0.120586 2023/10/26 10:43:21 INFO:tensorflow:loss = 0.031007526, step = 600 (829.286 sec) 2023/10/26 10:43:21 INFO:tensorflow:ctctr_loss = 0.014418589, ctcvr_loss = 0.00030601368, ctr_loss = 0.016282924, loss = 0.031007526 (829.286 sec) 2023/10/26 10:57:14 INFO:tensorflow:global_step/sec: 0.120088 2023/10/26 10:57:14 INFO:tensorflow:loss = 0.036426634, step = 700 (832.720 sec) 2023/10/26 10:57:14 INFO:tensorflow:ctctr_loss = 0.016776908, ctcvr_loss = 0.000521396, ctr_loss = 0.01912833, loss = 0.036426634 (832.720 sec) 2023/10/26 11:10:51 INFO:tensorflow:global_step/sec: 0.122373 2023/10/26 11:10:51 INFO:tensorflow:loss = 0.03228343, step = 800 (817.172 sec) 2023/10/26 11:10:51 INFO:tensorflow:ctctr_loss = 0.014936367, ctcvr_loss = 0.00016124125, ctr_loss = 0.017185822, loss = 0.03228343 (817.172 sec) 2023/10/26 11:24:24 INFO:tensorflow:global_step/sec: 0.123052 2023/10/26 11:24:24 INFO:tensorflow:loss = 0.032250684, step = 900 (812.666 sec) 2023/10/26 11:24:24 INFO:tensorflow:ctctr_loss = 0.014788659, ctcvr_loss = 0.0004015968, ctr_loss = 0.017060427, loss = 0.032250684 (812.665 sec) 2023/10/26 11:38:02 INFO:tensorflow:Saving checkpoints for 1000 into ../sug_ple_model/model.ckpt. 2023/10/26 11:38:20 INFO:tensorflow:global_step/sec: 0.119588 2023/10/26 11:38:20 INFO:tensorflow:loss = 0.033876635, step = 1000 (836.209 sec) 2023/10/26 11:38:20 INFO:tensorflow:ctctr_loss = 0.015491962, ctcvr_loss = 0.0004009509, ctr_loss = 0.017983723, loss = 0.033876635 (836.209 sec) 2023/10/26 11:52:15 INFO:tensorflow:global_step/sec: 0.119757 2023/10/26 11:52:15 INFO:tensorflow:loss = 0.03441868, step = 1100 (835.023 sec) 2023/10/26 11:52:15 INFO:tensorflow:ctctr_loss = 0.015090618, ctcvr_loss = 0.0005589067, ctr_loss = 0.018769156, loss = 0.03441868 (835.023 sec) 2023/10/26 12:05:50 INFO:tensorflow:global_step/sec: 0.122696 2023/10/26 12:05:50 INFO:tensorflow:loss = 0.03474028, step = 1200 (815.024 sec) 2023/10/26 12:05:50 INFO:tensorflow:ctctr_loss = 0.015631478, ctcvr_loss = 0.0003598568, ctr_loss = 0.018748946, loss = 0.03474028 (815.023 sec) 2023/10/26 12:12:05 2023/10/26 12:12:05 job-1111-2222-0:293:765 [0] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer job-1111-2222-3.service-1111-2222.autopai.svc.cluster.local<53572> 2023/10/26 12:12:06 job-1111-2222-0:293:765 [0] NCCL INFO transport/net_socket.cc:493 -> 2 2023/10/26 12:12:06 job-1111-2222-0:293:765 [0] NCCL INFO include/net.h:32 -> 2 2023/10/26 12:12:06 job-1111-2222-0:293:765 [0] NCCL INFO transport/net.cc:996 -> 2 2023/10/26 12:12:06 job-1111-2222-0:293:765 [0] NCCL INFO proxy.cc:494 -> 2 2023/10/26 12:12:06 job-1111-2222-0:293:765 [0] NCCL INFO proxy.cc:614 -> 2 [Proxy Thread] 2023/10/26 12:12:06 2023-10-26 12:12:06.704113: E hybridbackend/tensorflow/distribute/nccl/nccl_collective.cc:461] NCCL communication aborted: unhandled system error 2023/10/26 12:12:07 job-1111-2222-0:293:763 [0] NCCL INFO comm 0x7fda2c015e20 rank 0 nranks 4 cudaDev 0 busId d9000 - Abort COMPLETE 2023/10/26 12:12:07 2023-10-26 12:12:07.399692: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_create.cc:112 : Internal: unhandled system error 2023/10/26 12:12:20 /usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'tensorflow.python.distribute.launch' found in sys.modules after import of package 'tensorflow.python.distribute', but prior to execution of 'tensorflow.python.distribute.launch'; this may result in unpredictable behaviour 2023/10/26 12:12:20 warn(RuntimeWarning(msg)) 2023/10/26 12:12:20 Traceback (most recent call last): 2023/10/26 12:12:20 File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main 2023/10/26 12:12:20 return _run_code(code, main_globals, None, 2023/10/26 12:12:20 File "/usr/lib/python3.8/runpy.py", line 87, in _run_code 2023/10/26 12:12:20 exec(code, run_globals) 2023/10/26 12:12:20 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 319, in 2023/10/26 12:12:20 launch([args.command] + args.args) 2023/10/26 12:12:20 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 97, in launch 2023/10/26 12:12:20 subprocess.check_call(command) 2023/10/26 12:12:20 File "/usr/lib/python3.8/subprocess.py", line 364, in check_call 2023/10/26 12:12:20 raise CalledProcessError(retcode, cmd) 2023/10/26 12:12:20 subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'run.py', '--worker_index=0', '--worker_size=4', '--protocol=grpc']' died with <Signals.SIGSEGV: 11>. 2023/10/26 12:12:20 start launching tensorflow job 2023/10/26 12:12:20 {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":0,"type":"chief"}} 2023/10/26 12:12:20 tf_args: ['--worker_index=0', '--worker_size=4', '--protocol=grpc'] 2023/10/26 12:12:20 TF_CONFIG= {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":0,"type":"chief"}} 2023/10/26 12:12:20 Start to gen endpoint_map file. 2023/10/26 12:12:20 node_list: ['job-1111-2222-0.service-1111-2222:2222', 'job-1111-2222-1.service-1111-2222:2222', 'job-1111-2222-2.service-1111-2222:2222', 'job-1111-2222-3.service-1111-2222:2222'] 2023/10/26 12:12:20 run tensorflow command: cd ./model; COLLECTIVE_STRATEGY=hb (which python) -u run.py --worker_index=0 --worker_size=4 --protocol=grpc 2023/10/26 12:12:20 ret_code: 1 2023/10/26 12:12:21 train error code:0 2023/10/26 12:12:21 exec result:1

worker:

2023/10/26 09:16:39 INFO:tensorflow:Done calling model_fn. 2023/10/26 09:16:39 INFO:tensorflow:Create CheckpointSaverHook. 2023/10/26 09:16:39 INFO:tensorflow:Init incremental saver , incremental_save:False, incremental_path:/tmp/tmpac24awf3/.incremental_checkpoint/incremental_model.ckpt 2023/10/26 09:16:58 INFO:tensorflow:Graph was finalized. 2023/10/26 09:16:58 INFO:tensorflow:run without loading checkpoint 2023/10/26 09:17:10 2023-10-26 09:17:10.584288: I ./tensorflow/core/framework/ev_allocator.h:555] EVAllocator set arena array size: 128 2023/10/26 09:17:25 INFO:tensorflow:Running local_init_op. 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Bootstrap : Using eth0:10.241.120.67<0> 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.120.67<0> 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Using network Socket 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Trees [0] 1/3/-1->2->0 [1] -1/-1/-1->2->1 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 1[5e000] -> 2[3b000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 01/0 : 1[5e000] -> 2[3b000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 2[3b000] -> 3[b1000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 01/0 : 2[3b000] -> 3[b1000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Connected all rings 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 0[d9000] -> 2[3b000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 2[3b000] -> 0[d9000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 3[b1000] -> 2[3b000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 00/0 : 2[3b000] -> 1[5e000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Channel 01/0 : 2[3b000] -> 1[5e000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO Connected all trees 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer 2023/10/26 09:17:29 job-1111-2222-2:292:698 [0] NCCL INFO comm 0x7f0e1b123750 rank 2 nranks 4 cudaDev 0 busId 3b000 - Init COMPLETE 2023/10/26 09:17:29 INFO:tensorflow:Done running local_init_op. 2023/10/26 09:18:00 INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpac24awf3/model.ckpt. 2023/10/26 09:18:06 INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None 2023/10/26 09:19:04 2023-10-26 09:19:04.065744: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.065939: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.066213: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.066313: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.084363: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.084621: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.088631: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.111988: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.112148: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.112225: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.289962: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublas.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.170070: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublasLt.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.170152: I tensorflow/stream_executor/cuda/cuda_blas.cc:239] Create cuBLASLt success. 2023/10/26 09:19:05 INFO:tensorflow:loss = 0.19766834, step = 0 2023/10/26 09:19:05 INFO:tensorflow:ctctr_loss = 0.035726123, ctcvr_loss = 0.0010907989, ctr_loss = 0.16085143, loss = 0.19766834 2023/10/26 09:25:51 2023-10-26 09:25:51.287298: I tensorflow/core/common_runtime/tensorpool_allocator.cc:146] TensorPoolAllocator enabled 2023/10/26 09:33:05 INFO:tensorflow:global_step/sec: 0.119053 2023/10/26 09:33:05 INFO:tensorflow:loss = 0.046693888, step = 100 (839.960 sec) 2023/10/26 09:33:05 INFO:tensorflow:ctctr_loss = 0.016436137, ctcvr_loss = 0.0005084026, ctr_loss = 0.029749347, loss = 0.046693888 (839.960 sec) 2023/10/26 09:46:52 INFO:tensorflow:global_step/sec: 0.120922 2023/10/26 09:46:52 INFO:tensorflow:loss = 0.029226346, step = 200 (826.981 sec) 2023/10/26 09:46:52 INFO:tensorflow:ctctr_loss = 0.012963342, ctcvr_loss = 0.00040178455, ctr_loss = 0.01586122, loss = 0.029226346 (826.981 sec) 2023/10/26 10:00:37 INFO:tensorflow:global_step/sec: 0.12121 2023/10/26 10:00:37 INFO:tensorflow:loss = 0.042071566, step = 300 (825.018 sec) 2023/10/26 10:00:37 INFO:tensorflow:ctctr_loss = 0.019438155, ctcvr_loss = 0.0003417362, ctr_loss = 0.022291675, loss = 0.042071566 (825.018 sec) 2023/10/26 10:15:51 INFO:tensorflow:global_step/sec: 0.109466 2023/10/26 10:15:51 INFO:tensorflow:loss = 0.034424614, step = 400 (913.527 sec) 2023/10/26 10:15:51 INFO:tensorflow:ctctr_loss = 0.015548481, ctcvr_loss = 0.0001899354, ctr_loss = 0.018686198, loss = 0.034424614 (913.527 sec) 2023/10/26 10:29:32 INFO:tensorflow:global_step/sec: 0.121767 2023/10/26 10:29:32 INFO:tensorflow:loss = 0.030513769, step = 500 (821.242 sec) 2023/10/26 10:29:32 INFO:tensorflow:ctctr_loss = 0.013715225, ctcvr_loss = 0.00022438697, ctr_loss = 0.016574156, loss = 0.030513769 (821.242 sec) 2023/10/26 10:43:21 INFO:tensorflow:global_step/sec: 0.120589 2023/10/26 10:43:21 INFO:tensorflow:loss = 0.02965854, step = 600 (829.263 sec) 2023/10/26 10:43:21 INFO:tensorflow:ctctr_loss = 0.013955852, ctcvr_loss = 0.00019742409, ctr_loss = 0.015505264, loss = 0.02965854 (829.263 sec) 2023/10/26 10:57:14 INFO:tensorflow:global_step/sec: 0.120092 2023/10/26 10:57:14 INFO:tensorflow:loss = 0.028629225, step = 700 (832.693 sec) 2023/10/26 10:57:14 INFO:tensorflow:ctctr_loss = 0.012916515, ctcvr_loss = 0.00010687506, ctr_loss = 0.015605834, loss = 0.028629225 (832.693 sec) 2023/10/26 11:10:51 INFO:tensorflow:global_step/sec: 0.12237 2023/10/26 11:10:51 INFO:tensorflow:loss = 0.03414993, step = 800 (817.193 sec) 2023/10/26 11:10:51 INFO:tensorflow:ctctr_loss = 0.015847603, ctcvr_loss = 0.00011763776, ctr_loss = 0.01818469, loss = 0.03414993 (817.193 sec) 2023/10/26 11:24:24 INFO:tensorflow:global_step/sec: 0.123054 2023/10/26 11:24:24 INFO:tensorflow:loss = 0.035816602, step = 900 (812.650 sec) 2023/10/26 11:24:24 INFO:tensorflow:ctctr_loss = 0.016396968, ctcvr_loss = 0.00031063452, ctr_loss = 0.019109001, loss = 0.035816602 (812.650 sec) 2023/10/26 11:38:02 INFO:tensorflow:Saving checkpoints for 1000 into /tmp/tmpac24awf3/model.ckpt. 2023/10/26 11:38:20 INFO:tensorflow:global_step/sec: 0.119587 2023/10/26 11:38:20 INFO:tensorflow:loss = 0.03361716, step = 1000 (836.215 sec) 2023/10/26 11:38:20 INFO:tensorflow:ctctr_loss = 0.015267236, ctcvr_loss = 0.00011550254, ctr_loss = 0.018234422, loss = 0.03361716 (836.215 sec) 2023/10/26 11:52:15 INFO:tensorflow:global_step/sec: 0.119753 2023/10/26 11:52:15 INFO:tensorflow:loss = 0.02796927, step = 1100 (835.049 sec) 2023/10/26 11:52:15 INFO:tensorflow:ctctr_loss = 0.012550943, ctcvr_loss = 0.00030419877, ctr_loss = 0.0151141295, loss = 0.02796927 (835.049 sec) 2023/10/26 12:05:50 INFO:tensorflow:global_step/sec: 0.122703 2023/10/26 12:05:50 INFO:tensorflow:loss = 0.034918174, step = 1200 (814.979 sec) 2023/10/26 12:05:50 INFO:tensorflow:ctctr_loss = 0.015425372, ctcvr_loss = 0.00069285027, ctr_loss = 0.018799951, loss = 0.034918174 (814.979 sec) 2023/10/26 12:12:04 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer job-1111-2222-1.service-1111-2222.autopai.svc.cluster.local<34106> 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] NCCL INFO transport/net_socket.cc:493 -> 2 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] NCCL INFO include/net.h:32 -> 2 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] NCCL INFO transport/net.cc:996 -> 2 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] NCCL INFO proxy.cc:494 -> 2 2023/10/26 12:12:05 job-1111-2222-2:292:700 [0] NCCL INFO proxy.cc:614 -> 2 [Proxy Thread] 2023/10/26 12:12:05 2023-10-26 12:12:05.022824: E hybridbackend/tensorflow/distribute/nccl/nccl_collective.cc:461] NCCL communication aborted: unhandled system error 2023/10/26 12:12:05 job-1111-2222-2:292:698 [0] NCCL INFO comm 0x7f0e1b123750 rank 2 nranks 4 cudaDev 0 busId 3b000 - Abort COMPLETE 2023/10/26 12:12:05 2023/10/26 12:12:05 job-1111-2222-2:292:696 [0] misc/argcheck.cc:39 NCCL WARN AllGather : invalid root 0 (root should be in the 0..-1 range) 2023/10/26 12:12:05 job-1111-2222-2:292:696 [0] NCCL INFO enqueue.cc:1342 -> 4 2023/10/26 12:12:05 2023-10-26 12:12:05.655401: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_create.cc:112 : Internal: unhandled system error 2023/10/26 12:12:05 2023-10-26 12:12:05.655431: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_allgatherv.cc:130 : Internal: invalid argument 2023/10/26 12:12:18 /usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'tensorflow.python.distribute.launch' found in sys.modules after import of package 'tensorflow.python.distribute', but prior to execution of 'tensorflow.python.distribute.launch'; this may result in unpredictable behaviour 2023/10/26 12:12:18 warn(RuntimeWarning(msg)) 2023/10/26 12:12:18 Traceback (most recent call last): 2023/10/26 12:12:18 File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main 2023/10/26 12:12:18 return _run_code(code, main_globals, None, 2023/10/26 12:12:18 File "/usr/lib/python3.8/runpy.py", line 87, in _run_code 2023/10/26 12:12:18 exec(code, run_globals) 2023/10/26 12:12:18 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 319, in 2023/10/26 12:12:18 launch([args.command] + args.args) 2023/10/26 12:12:18 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 97, in launch 2023/10/26 12:12:18 subprocess.check_call(command) 2023/10/26 12:12:18 File "/usr/lib/python3.8/subprocess.py", line 364, in check_call 2023/10/26 12:12:18 raise CalledProcessError(retcode, cmd) 2023/10/26 12:12:18 subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'run.py', '--worker_index=2', '--worker_size=4', '--protocol=grpc']' died with <Signals.SIGSEGV: 11>. 2023/10/26 12:12:19 start launching tensorflow job 2023/10/26 12:12:19 {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":1,"type":"worker"}} 2023/10/26 12:12:19 tf_args: ['--worker_index=2', '--worker_size=4', '--protocol=grpc'] 2023/10/26 12:12:19 TF_CONFIG= {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":1,"type":"worker"}} 2023/10/26 12:12:19 Start to gen endpoint_map file. 2023/10/26 12:12:19 node_list: ['job-1111-2222-0.service-1111-2222:2222', 'job-1111-2222-1.service-1111-2222:2222', 'job-1111-2222-2.service-1111-2222:2222', 'job-1111-2222-3.service-1111-2222:2222'] 2023/10/26 12:12:19 run tensorflow command: cd ./model; COLLECTIVE_STRATEGY=hb (which python) -u run.py --worker_index=2 --worker_size=4 --protocol=grpc 2023/10/26 12:12:19 ret_code: 1 2023/10/26 12:12:19 train error code:0 2023/10/26 12:12:19 exec result:1 2023/10/26 12:12:19 fail\t2222\t1111\t-1\t-1 2023/10/26 12:12:19 updateStatusForSirius param:{"jobId":"1111","creator":"cmp","historyId":"2222","podIndex":"2","status":"fail"}

It seems you are not showing me the full log. I only see the rank 0 and rank 2. Can you show me the other two as well? Are you running on a single node?

In addition, can you let me know the relation between chief and worker?

It seems you are not showing me the full log. I only see the rank 0 and rank 2. Can you show me the other two as well? Are you running on a single node?

In addition, can you let me know the relation between chief and worker?

training with 1chief and 3worker，use k8s create 4 pods on a single node，chief and worker is from the Tensorflow TF_CONFIG，

chief：TF_CONFIG= {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":0,"type":"chief"}}

worker1：TF_CONFIG= {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":0,"type":"worker"}}

worker2：TF_CONFIG= {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":1,"type":"worker"}}

worker3：TF_CONFIG= {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":2,"type":"worker"}}

other tow workers log:

worker 1:

10/26 09:16:36 INFO:tensorflow:Init incremental saver , incremental_save:False, incremental_path:/tmp/tmp2j7svan9/.incremental_checkpoint/incremental_model.ckpt 2023/10/26 09:16:56 INFO:tensorflow:Graph was finalized. 2023/10/26 09:16:56 INFO:tensorflow:run without loading checkpoint 2023/10/26 09:17:10 2023-10-26 09:17:10.122664: I ./tensorflow/core/framework/ev_allocator.h:555] EVAllocator set arena array size: 128 2023/10/26 09:17:26 INFO:tensorflow:Running local_init_op. 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Bootstrap : Using eth0:10.241.120.101<0> 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.120.101<0> 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Using network Socket 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Trees [0] -1/-1/-1->1->2 [1] 2/0/-1->1->3 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Channel 00/0 : 0[d9000] -> 1[5e000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Channel 01/0 : 0[d9000] -> 1[5e000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Channel 00/0 : 1[5e000] -> 2[3b000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Channel 01/0 : 1[5e000] -> 2[3b000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Connected all rings 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Channel 01/0 : 3[b1000] -> 1[5e000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Channel 01/0 : 1[5e000] -> 3[b1000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Channel 00/0 : 2[3b000] -> 1[5e000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Channel 01/0 : 2[3b000] -> 1[5e000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Channel 01/0 : 1[5e000] -> 0[d9000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO Connected all trees 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer 2023/10/26 09:17:29 job-1111-2222-1:292:748 [0] NCCL INFO comm 0x7fb80b123160 rank 1 nranks 4 cudaDev 0 busId 5e000 - Init COMPLETE 2023/10/26 09:17:29 INFO:tensorflow:Done running local_init_op. 2023/10/26 09:18:02 INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp2j7svan9/model.ckpt. 2023/10/26 09:18:06 INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None 2023/10/26 09:18:59 2023-10-26 09:18:59.116200: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:18:59 2023-10-26 09:18:59.117244: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:18:59 2023-10-26 09:18:59.117618: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:18:59 2023-10-26 09:18:59.118129: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.124608: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.127684: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.134733: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.135744: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.135877: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.136090: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.288254: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublas.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.176121: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublasLt.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.176228: I tensorflow/stream_executor/cuda/cuda_blas.cc:239] Create cuBLASLt success. 2023/10/26 09:19:05 INFO:tensorflow:loss = 0.1984115, step = 0 2023/10/26 09:19:05 INFO:tensorflow:ctctr_loss = 0.030469809, ctcvr_loss = 0.0011749854, ctr_loss = 0.1667667, loss = 0.1984115 2023/10/26 09:25:51 2023-10-26 09:25:51.291537: I tensorflow/core/common_runtime/tensorpool_allocator.cc:146] TensorPoolAllocator enabled 2023/10/26 09:33:05 INFO:tensorflow:global_step/sec: 0.119055 2023/10/26 09:33:05 INFO:tensorflow:loss = 0.046763625, step = 100 (839.939 sec) 2023/10/26 09:33:05 INFO:tensorflow:ctctr_loss = 0.016671129, ctcvr_loss = 0.00032188167, ctr_loss = 0.029770616, loss = 0.046763625 (839.938 sec) 2023/10/26 09:46:52 INFO:tensorflow:global_step/sec: 0.120917 2023/10/26 09:46:52 INFO:tensorflow:loss = 0.033281263, step = 200 (827.017 sec) 2023/10/26 09:46:52 INFO:tensorflow:ctctr_loss = 0.014697228, ctcvr_loss = 0.0004189724, ctr_loss = 0.018165065, loss = 0.033281263 (827.017 sec) 2023/10/26 10:00:37 INFO:tensorflow:global_step/sec: 0.121212 2023/10/26 10:00:37 INFO:tensorflow:loss = 0.033239283, step = 300 (824.998 sec) 2023/10/26 10:00:37 INFO:tensorflow:ctctr_loss = 0.015475099, ctcvr_loss = 0.00031236862, ctr_loss = 0.017451813, loss = 0.033239283 (824.998 sec) 2023/10/26 10:15:51 INFO:tensorflow:global_step/sec: 0.109467 2023/10/26 10:15:51 INFO:tensorflow:loss = 0.028706813, step = 400 (913.520 sec) 2023/10/26 10:15:51 INFO:tensorflow:ctctr_loss = 0.012859647, ctcvr_loss = 0.00019824652, ctr_loss = 0.015648918, loss = 0.028706813 (913.520 sec) 2023/10/26 10:29:32 INFO:tensorflow:global_step/sec: 0.121768 2023/10/26 10:29:32 INFO:tensorflow:loss = 0.03453713, step = 500 (821.233 sec) 2023/10/26 10:29:32 INFO:tensorflow:ctctr_loss = 0.015618616, ctcvr_loss = 0.00038022772, ctr_loss = 0.018538283, loss = 0.03453713 (821.232 sec) 2023/10/26 10:43:21 INFO:tensorflow:global_step/sec: 0.120586 2023/10/26 10:43:21 INFO:tensorflow:loss = 0.035473514, step = 600 (829.286 sec) 2023/10/26 10:43:21 INFO:tensorflow:ctctr_loss = 0.015827775, ctcvr_loss = 0.0005388999, ctr_loss = 0.019106839, loss = 0.035473514 (829.286 sec) 2023/10/26 10:57:14 INFO:tensorflow:global_step/sec: 0.120092 2023/10/26 10:57:14 INFO:tensorflow:loss = 0.029336259, step = 700 (832.691 sec) 2023/10/26 10:57:14 INFO:tensorflow:ctctr_loss = 0.013747658, ctcvr_loss = 0.00010347301, ctr_loss = 0.0154851265, loss = 0.029336259 (832.691 sec) 2023/10/26 11:10:51 INFO:tensorflow:global_step/sec: 0.12237 2023/10/26 11:10:51 INFO:tensorflow:loss = 0.036820434, step = 800 (817.190 sec) 2023/10/26 11:10:51 INFO:tensorflow:ctctr_loss = 0.016582247, ctcvr_loss = 0.0004719695, ctr_loss = 0.019766219, loss = 0.036820434 (817.190 sec) 2023/10/26 11:24:24 INFO:tensorflow:global_step/sec: 0.12305 2023/10/26 11:24:24 INFO:tensorflow:loss = 0.035816178, step = 900 (812.681 sec) 2023/10/26 11:24:24 INFO:tensorflow:ctctr_loss = 0.01643834, ctcvr_loss = 0.00053072796, ctr_loss = 0.018847108, loss = 0.035816178 (812.681 sec) 2023/10/26 11:38:02 INFO:tensorflow:Saving checkpoints for 1000 into /tmp/tmp2j7svan9/model.ckpt. 2023/10/26 11:38:20 INFO:tensorflow:global_step/sec: 0.119591 2023/10/26 11:38:20 INFO:tensorflow:loss = 0.035530917, step = 1000 (836.183 sec) 2023/10/26 11:38:20 INFO:tensorflow:ctctr_loss = 0.016622433, ctcvr_loss = 0.00038243624, ctr_loss = 0.01852605, loss = 0.035530917 (836.183 sec) 2023/10/26 11:52:15 INFO:tensorflow:global_step/sec: 0.119754 2023/10/26 11:52:15 INFO:tensorflow:loss = 0.025787458, step = 1100 (835.042 sec) 2023/10/26 11:52:15 INFO:tensorflow:ctctr_loss = 0.011589902, ctcvr_loss = 0.00016119736, ctr_loss = 0.014036359, loss = 0.025787458 (835.042 sec) 2023/10/26 12:05:50 INFO:tensorflow:global_step/sec: 0.122703 2023/10/26 12:05:50 INFO:tensorflow:loss = 0.031997748, step = 1200 (814.978 sec) 2023/10/26 12:05:50 INFO:tensorflow:ctctr_loss = 0.014961975, ctcvr_loss = 0.00012216032, ctr_loss = 0.016913611, loss = 0.031997748 (814.978 sec) 2023/10/26 12:12:05 /usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'tensorflow.python.distribute.launch' found in sys.modules after import of package 'tensorflow.python.distribute', but prior to execution of 'tensorflow.python.distribute.launch'; this may result in unpredictable behaviour 2023/10/26 12:12:05 warn(RuntimeWarning(msg)) 2023/10/26 12:12:05 Traceback (most recent call last): 2023/10/26 12:12:05 File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main 2023/10/26 12:12:05 return _run_code(code, main_globals, None, 2023/10/26 12:12:05 File "/usr/lib/python3.8/runpy.py", line 87, in _run_code 2023/10/26 12:12:05 exec(code, run_globals) 2023/10/26 12:12:05 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 319, in 2023/10/26 12:12:05 launch([args.command] + args.args) 2023/10/26 12:12:05 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 97, in launch 2023/10/26 12:12:05 subprocess.check_call(command) 2023/10/26 12:12:05 File "/usr/lib/python3.8/subprocess.py", line 364, in check_call 2023/10/26 12:12:05 raise CalledProcessError(retcode, cmd) 2023/10/26 12:12:05 subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'run.py', '--worker_index=1', '--worker_size=4', '--protocol=grpc']' died with <Signals.SIGSEGV: 11>. 2023/10/26 12:12:05 start launching tensorflow job 2023/10/26 12:12:05 {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":0,"type":"worker"}} 2023/10/26 12:12:05 tf_args: ['--worker_index=1', '--worker_size=4', '--protocol=grpc'] 2023/10/26 12:12:05 TF_CONFIG= {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":0,"type":"worker"}} 2023/10/26 12:12:05 Start to gen endpoint_map file. 2023/10/26 12:12:05 node_list: ['job-1111-2222-0.service-1111-2222:2222', 'job-1111-2222-1.service-1111-2222:2222', 'job-1111-2222-2.service-1111-2222:2222', 'job-1111-2222-3.service-1111-2222:2222'] 2023/10/26 12:12:05 run tensorflow command: cd ./model; COLLECTIVE_STRATEGY=hb $(which python) -m tensorflow.python.distribute.launch $(which python) -u run.py --worker_index=1 --worker_size=4 --protocol=grpc 2023/10/26 12:12:05 ret_code: 1 2023/10/26 12:12:05 train error code:0 2023/10/26 12:12:05 exec result:1 2023/10/26 12:12:06 fail\t2222\t1111\t-1\t-1 2023/10/26 12:12:06 updateStatusForSirius param:{"jobId":"1111","creator":"cmp","historyId":"2222","podIndex":"1","status":"fail"}

worker3:

2023/10/26 09:16:45 INFO:tensorflow:Done calling model_fn. 2023/10/26 09:16:45 INFO:tensorflow:Create CheckpointSaverHook. 2023/10/26 09:16:46 INFO:tensorflow:Init incremental saver , incremental_save:False, incremental_path:/tmp/tmp2x6s5j1p/.incremental_checkpoint/incremental_model.ckpt 2023/10/26 09:17:05 INFO:tensorflow:Graph was finalized. 2023/10/26 09:17:05 INFO:tensorflow:run without loading checkpoint 2023/10/26 09:17:17 2023-10-26 09:17:17.509419: I ./tensorflow/core/framework/ev_allocator.h:555] EVAllocator set arena array size: 128 2023/10/26 09:17:27 INFO:tensorflow:Running local_init_op. 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Bootstrap : Using eth0:10.241.120.83<0> 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.120.83<0> 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Using network Socket 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Setting affinity for GPU 0 to aaaa,aaaaaaaa 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 1/-1/-1->3->-1 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Channel 00/0 : 2[3b000] -> 3[b1000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Channel 01/0 : 2[3b000] -> 3[b1000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Channel 00/0 : 3[b1000] -> 0[d9000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Channel 01/0 : 3[b1000] -> 0[d9000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Connected all rings 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Channel 01/0 : 1[5e000] -> 3[b1000] [receive] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Channel 01/0 : 3[b1000] -> 1[5e000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Channel 00/0 : 3[b1000] -> 2[3b000] [send] via NET/Socket/0 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO Connected all trees 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer 2023/10/26 09:17:29 job-1111-2222-3:292:749 [0] NCCL INFO comm 0x7f86b31237c0 rank 3 nranks 4 cudaDev 0 busId b1000 - Init COMPLETE 2023/10/26 09:17:29 INFO:tensorflow:Done running local_init_op. 2023/10/26 09:18:01 INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp2x6s5j1p/model.ckpt. 2023/10/26 09:18:06 INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None 2023/10/26 09:19:04 2023-10-26 09:19:04.111420: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.112510: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.114952: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.117856: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.119961: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.120083: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.122583: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.133342: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.133449: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.135369: W tensorflow/core/common_runtime/bfc_allocator.cc:242] Allocator (GPU_0_bfc) ran out of memory trying to allocate 0B with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2023/10/26 09:19:04 2023-10-26 09:19:04.280576: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublas.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.192753: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcublasLt.so.11 2023/10/26 09:19:05 2023-10-26 09:19:05.192856: I tensorflow/stream_executor/cuda/cuda_blas.cc:239] Create cuBLASLt success. 2023/10/26 09:19:05 INFO:tensorflow:loss = 0.2018823, step = 0 2023/10/26 09:19:05 INFO:tensorflow:ctctr_loss = 0.034875683, ctcvr_loss = 0.0012742024, ctr_loss = 0.16573241, loss = 0.2018823 2023/10/26 09:25:51 2023-10-26 09:25:51.291072: I tensorflow/core/common_runtime/tensorpool_allocator.cc:146] TensorPoolAllocator enabled 2023/10/26 09:33:05 INFO:tensorflow:global_step/sec: 0.119056 2023/10/26 09:33:05 INFO:tensorflow:loss = 0.050051894, step = 100 (839.937 sec) 2023/10/26 09:33:05 INFO:tensorflow:ctctr_loss = 0.018538006, ctcvr_loss = 0.00054236903, ctr_loss = 0.030971522, loss = 0.050051894 (839.937 sec) 2023/10/26 09:46:52 INFO:tensorflow:global_step/sec: 0.120918 2023/10/26 09:46:52 INFO:tensorflow:loss = 0.0343645, step = 200 (827.009 sec) 2023/10/26 09:46:52 INFO:tensorflow:ctctr_loss = 0.015402321, ctcvr_loss = 0.00022376279, ctr_loss = 0.018738417, loss = 0.0343645 (827.009 sec) 2023/10/26 10:00:37 INFO:tensorflow:global_step/sec: 0.121208 2023/10/26 10:00:37 INFO:tensorflow:loss = 0.033952434, step = 300 (825.028 sec) 2023/10/26 10:00:37 INFO:tensorflow:ctctr_loss = 0.01584262, ctcvr_loss = 0.00032934497, ctr_loss = 0.017780468, loss = 0.033952434 (825.028 sec) 2023/10/26 10:15:51 INFO:tensorflow:global_step/sec: 0.109467 2023/10/26 10:15:51 INFO:tensorflow:loss = 0.029901836, step = 400 (913.519 sec) 2023/10/26 10:15:51 INFO:tensorflow:ctctr_loss = 0.013402802, ctcvr_loss = 0.00027564124, ctr_loss = 0.016223392, loss = 0.029901836 (913.519 sec) 2023/10/26 10:29:32 INFO:tensorflow:global_step/sec: 0.121769 2023/10/26 10:29:32 INFO:tensorflow:loss = 0.03641034, step = 500 (821.225 sec) 2023/10/26 10:29:32 INFO:tensorflow:ctctr_loss = 0.017152058, ctcvr_loss = 0.0003508015, ctr_loss = 0.018907476, loss = 0.03641034 (821.225 sec) 2023/10/26 10:43:21 INFO:tensorflow:global_step/sec: 0.120587 2023/10/26 10:43:21 INFO:tensorflow:loss = 0.02814669, step = 600 (829.274 sec) 2023/10/26 10:43:21 INFO:tensorflow:ctctr_loss = 0.012826523, ctcvr_loss = 5.248749e-05, ctr_loss = 0.01526768, loss = 0.02814669 (829.274 sec) 2023/10/26 10:57:14 INFO:tensorflow:global_step/sec: 0.12009 2023/10/26 10:57:14 INFO:tensorflow:loss = 0.03927169, step = 700 (832.709 sec) 2023/10/26 10:57:14 INFO:tensorflow:ctctr_loss = 0.017803224, ctcvr_loss = 0.0006114731, ctr_loss = 0.020856993, loss = 0.03927169 (832.709 sec) 2023/10/26 11:10:51 INFO:tensorflow:global_step/sec: 0.122369 2023/10/26 11:10:51 INFO:tensorflow:loss = 0.03692754, step = 800 (817.198 sec) 2023/10/26 11:10:51 INFO:tensorflow:ctctr_loss = 0.01682096, ctcvr_loss = 0.00046136754, ctr_loss = 0.019645214, loss = 0.03692754 (817.198 sec) 2023/10/26 11:24:24 INFO:tensorflow:global_step/sec: 0.123054 2023/10/26 11:24:24 INFO:tensorflow:loss = 0.036028706, step = 900 (812.648 sec) 2023/10/26 11:24:24 INFO:tensorflow:ctctr_loss = 0.016415253, ctcvr_loss = 0.00017042492, ctr_loss = 0.019443028, loss = 0.036028706 (812.648 sec) 2023/10/26 11:38:02 INFO:tensorflow:Saving checkpoints for 1000 into /tmp/tmp2x6s5j1p/model.ckpt. 2023/10/26 11:38:20 INFO:tensorflow:global_step/sec: 0.119583 2023/10/26 11:38:20 INFO:tensorflow:loss = 0.029550092, step = 1000 (836.239 sec) 2023/10/26 11:38:20 INFO:tensorflow:ctctr_loss = 0.013028747, ctcvr_loss = 0.00042268087, ctr_loss = 0.016098663, loss = 0.029550092 (836.239 sec) 2023/10/26 11:52:15 INFO:tensorflow:global_step/sec: 0.11976 2023/10/26 11:52:15 INFO:tensorflow:loss = 0.030254513, step = 1100 (835.002 sec) 2023/10/26 11:52:15 INFO:tensorflow:ctctr_loss = 0.013663508, ctcvr_loss = 0.00033414547, ctr_loss = 0.01625686, loss = 0.030254513 (835.002 sec) 2023/10/26 12:05:50 INFO:tensorflow:global_step/sec: 0.122703 2023/10/26 12:05:50 INFO:tensorflow:loss = 0.030352065, step = 1200 (814.975 sec) 2023/10/26 12:05:50 INFO:tensorflow:ctctr_loss = 0.013639426, ctcvr_loss = 0.00025419748, ctr_loss = 0.016458442, loss = 0.030352065 (814.975 sec) 2023/10/26 12:12:05 2023/10/26 12:12:05 job-1111-2222-3:292:751 [0] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer job-1111-2222-2.service-1111-2222.autopai.svc.cluster.local<38422> 2023/10/26 12:12:05 job-1111-2222-3:292:751 [0] NCCL INFO transport/net_socket.cc:493 -> 2 2023/10/26 12:12:05 job-1111-2222-3:292:751 [0] NCCL INFO include/net.h:32 -> 2 2023/10/26 12:12:05 job-1111-2222-3:292:751 [0] NCCL INFO transport/net.cc:996 -> 2 2023/10/26 12:12:05 job-1111-2222-3:292:751 [0] NCCL INFO proxy.cc:494 -> 2 2023/10/26 12:12:05 job-1111-2222-3:292:751 [0] NCCL INFO proxy.cc:614 -> 2 [Proxy Thread] 2023/10/26 12:12:05 2023-10-26 12:12:05.828010: E hybridbackend/tensorflow/distribute/nccl/nccl_collective.cc:461] NCCL communication aborted: unhandled system error 2023/10/26 12:12:05 job-1111-2222-3:292:749 [0] NCCL INFO comm 0x7f86b31237c0 rank 3 nranks 4 cudaDev 0 busId b1000 - Abort COMPLETE 2023/10/26 12:12:06 2023-10-26 12:12:05.919497: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_create.cc:112 : Internal: unhandled system error 2023/10/26 12:12:19 /usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'tensorflow.python.distribute.launch' found in sys.modules after import of package 'tensorflow.python.distribute', but prior to execution of 'tensorflow.python.distribute.launch'; this may result in unpredictable behaviour 2023/10/26 12:12:19 warn(RuntimeWarning(msg)) 2023/10/26 12:12:19 Traceback (most recent call last): 2023/10/26 12:12:19 File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main 2023/10/26 12:12:19 return _run_code(code, main_globals, None, 2023/10/26 12:12:19 File "/usr/lib/python3.8/runpy.py", line 87, in _run_code 2023/10/26 12:12:19 exec(code, run_globals) 2023/10/26 12:12:19 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 319, in 2023/10/26 12:12:19 launch([args.command] + args.args) 2023/10/26 12:12:19 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 97, in launch 2023/10/26 12:12:19 subprocess.check_call(command) 2023/10/26 12:12:19 File "/usr/lib/python3.8/subprocess.py", line 364, in check_call 2023/10/26 12:12:19 raise CalledProcessError(retcode, cmd) 2023/10/26 12:12:19 subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'run.py', '--worker_index=3', '--worker_size=4', '--protocol=grpc']' died with <Signals.SIGSEGV: 11>. 2023/10/26 12:12:19 start launching tensorflow job 2023/10/26 12:12:19 {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":2,"type":"worker"}} 2023/10/26 12:12:19 tf_args: ['--worker_index=3', '--worker_size=4', '--protocol=grpc'] 2023/10/26 12:12:19 TF_CONFIG= {"cluster":{"chief":["job-1111-2222-0.service-1111-2222:2222"],"worker":["job-1111-2222-1.service-1111-2222:2222","job-1111-2222-2.service-1111-2222:2222","job-1111-2222-3.service-1111-2222:2222"]},"task":{"index":2,"type":"worker"}} 2023/10/26 12:12:19 Start to gen endpoint_map file. 2023/10/26 12:12:19 node_list: ['job-1111-2222-0.service-1111-2222:2222', 'job-1111-2222-1.service-1111-2222:2222', 'job-1111-2222-2.service-1111-2222:2222', 'job-1111-2222-3.service-1111-2222:2222'] 2023/10/26 12:12:19 run tensorflow command: cd ./model; COLLECTIVE_STRATEGY=hb $(which python) -m tensorflow.python.distribute.launch $(which python) -u run.py --worker_index=3 --worker_size=4 --protocol=grpc 2023/10/26 12:12:19 ret_code: 1 2023/10/26 12:12:19 train error code:0 2023/10/26 12:12:19 exec result:1 2023/10/26 12:12:19 fail\t2222\t1111\t-1\t-1 2023/10/26 12:12:19 updateStatusForSirius param:{"jobId":"1111","creator":"cmp","historyId":"2222","podIndex":"3","status":"fail"}

See this log from worker1

2023/10/26 12:12:05 /usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'tensorflow.python.distribute.launch' found in sys.modules after import of package 'tensorflow.python.distribute', but prior to execution of 'tensorflow.python.distribute.launch'; this may result in unpredictable behaviour
2023/10/26 12:12:05 warn(RuntimeWarning(msg))
2023/10/26 12:12:05 Traceback (most recent call last):
2023/10/26 12:12:05 File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
2023/10/26 12:12:05 return _run_code(code, main_globals, None,
2023/10/26 12:12:05 File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
2023/10/26 12:12:05 exec(code, run_globals)
2023/10/26 12:12:05 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 319, in
2023/10/26 12:12:05 launch([args.command] + args.args)
2023/10/26 12:12:05 File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/distribute/launch.py", line 97, in launch
2023/10/26 12:12:05 subprocess.check_call(command)
2023/10/26 12:12:05 File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
2023/10/26 12:12:05 raise CalledProcessError(retcode, cmd)
2023/10/26 12:12:05 subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'run.py', '--worker_index=1', '--worker_size=4', '--protocol=grpc']' died with <Signals.SIGSEGV: 11>.

It is the error from tensorflow, which causes the process exits unexpectedly. I don't think this is NCCL issue.

NVIDIA / nccl