2.6.1版本下paddle.utils.run_check()的全量日志，最后是卡住了 Running verify PaddlePaddle program ... I1107 11:51:30.232810 564606 op_desc.cc:1108] CompileTime infer shape on fill_constant I1107 11:51:30.233856 564606 op_desc.cc:1108] CompileTime infer shape on uniform_random I1107 11:51:30.233916 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: uniform; inputs: ; attributes: shape, dtype, min, max, seed; outputs: Out I1107 11:51:30.235167 564606 op_desc.cc:1108] CompileTime infer shape on matmul_v2 I1107 11:51:30.235188 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: matmul; inputs: X, Y; attributes: trans_x, trans_y; outputs: Out I1107 11:51:30.235697 564606 op_desc.cc:1108] CompileTime infer shape on elementwise_add I1107 11:51:30.243238 564606 op_desc.cc:1108] CompileTime infer shape on reduce_sum I1107 11:51:30.243304 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: sum_raw; inputs: X; attributes: dim, keep_dim, reduce_all, out_dtype; outputs: Out I1107 11:51:30.244683 564606 pybind.cc:1530] need skip: 0 I1107 11:51:30.244801 564606 pybind.cc:1530] need skip: 0 I1107 11:51:30.244917 564606 pybind.cc:1530] need skip: 1 I1107 11:51:30.245568 564606 op_desc.cc:1108] CompileTime infer shape on fill_constant I1107 11:51:30.245656 564606 op_desc.cc:1108] CompileTime infer shape on reduce_sum_grad I1107 11:51:30.245669 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: sum_grad; inputs: X, Out@GRAD; attributes: dim, keep_dim, reduce_all; outputs: X@GRAD I1107 11:51:30.245771 564606 op_desc.cc:1108] CompileTime infer shape on elementwise_add_grad I1107 11:51:30.245865 564606 op_desc.cc:1108] CompileTime infer shape on matmul_v2_grad I1107 11:51:30.245873 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: matmul_grad; inputs: X, Y, Out@GRAD; attributes: trans_x, trans_y; outputs: X@GRAD, Y@GRAD I1107 11:51:30.248023 564606 conditional_block_op_helper.cc:108] Found conditional_block op num: 0, conditional_block_grad op num: 0 I1107 11:51:30.248044 564606 pylayer_op_helper.cc:106] Found pylayer op num: 0, pylayer_grad op num: 0 I1107 11:51:30.248056 564606 while_op_helper.cc:154] Found while op num: 0, while grad op num: 0 I1107 11:51:30.248067 564606 recurrent_op_helper.cc:259] Found recurrent op num: 0, recurrent grad op num: 0 I1107 11:51:30.416949 564606 program_interpreter.cc:212] New Executor is Running. I1107 11:51:30.416982 564606 interpreter_util.cc:1109] Creating Variables I1107 11:51:30.417013 564606 scope.cc:203] Create variable create_parameter_0.w_0 I1107 11:51:30.417055 564606 interpreter_util.cc:1141] Create Variable create_parameter_0.w_0 global, which pointer is 0x59b60a20e3e0 type is 7 I1107 11:51:30.417074 564606 scope.cc:203] Create variable create_parameter_1.w_0 I1107 11:51:30.417083 564606 interpreter_util.cc:1141] Create Variable create_parameter_1.w_0 global, which pointer is 0x59b609564400 type is 7 I1107 11:51:30.417090 564606 scope.cc:203] Create variable feed I1107 11:51:30.417099 564606 interpreter_util.cc:1141] Create Variable feed global, which pointer is 0x59b60a20e3c0 type is 9 I1107 11:51:30.417105 564606 scope.cc:203] Create variable fetch I1107 11:51:30.417109 564606 interpreter_util.cc:1141] Create Variable fetch global, which pointer is 0x59b60a215670 type is 10 I1107 11:51:30.417187 564606 interpreter_util.cc:572] Static build: 0 I1107 11:51:30.417196 564606 conditional_block_op_helper.cc:108] Found conditional_block op num: 0, conditional_block_grad op num: 0 I1107 11:51:30.417212 564606 pylayer_op_helper.cc:106] Found pylayer op num: 0, pylayer_grad op num: 0 I1107 11:51:30.417222 564606 while_op_helper.cc:154] Found while op num: 0, while grad op num: 0 I1107 11:51:30.417232 564606 recurrent_op_helper.cc:259] Found recurrent op num: 0, recurrent grad op num: 0 W1107 11:51:30.418675 564606 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.7, Runtime API Version: 12.0 I1107 11:51:30.419036 564606 dynamic_loader.cc:227] Try to find library: libcudnn.so from default system path. W1107 11:51:30.419503 564606 gpu_resources.cc:164] device: 0, cuDNN Version: 9.5. I1107 11:51:30.431505 564606 dynamic_loader.cc:227] Try to find library: libcuda.so from default system path. I1107 11:51:30.431710 564606 operator.cc:2229] op type:fill_constant, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.431735 564606 interpreter_util.cc:821] fill_constant : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.432092 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee00000), and remaining 0 I1107 11:51:30.432133 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=6, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.433130 564606 operator.cc:2229] op type:uniform_random, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.433151 564606 interpreter_util.cc:821] uniform_random : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.433174 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: uniform; inputs: ; attributes: shape, dtype, min, max, seed; outputs: Out I1107 11:51:30.433239 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee00200), and remaining 0 I1107 11:51:30.433261 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=3, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.437081 564606 op_desc.cc:1108] CompileTime infer shape on fetch_v2 I1107 11:51:30.437336 564606 op_desc.cc:1108] CompileTime infer shape on fetch_v2 I1107 11:51:30.437868 564606 conditional_block_op_helper.cc:108] Found conditional_block op num: 0, conditional_block_grad op num: 0 I1107 11:51:30.437878 564606 pylayer_op_helper.cc:106] Found pylayer op num: 0, pylayer_grad op num: 0 I1107 11:51:30.437882 564606 while_op_helper.cc:154] Found while op num: 0, while grad op num: 0 I1107 11:51:30.437889 564606 recurrent_op_helper.cc:259] Found recurrent op num: 0, recurrent grad op num: 0 I1107 11:51:30.442024 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee00400), and remaining 0 I1107 11:51:30.442127 564606 feed_fetch_method.cc:54] SetFeedVariable name=feed index=0 I1107 11:51:30.442168 564606 interpreter_util.cc:1109] Creating Variables I1107 11:51:30.442178 564606 interpreter_util.cc:1141] Create Variable create_parameter_0.w_0 global, which pointer is 0x59b60a20e3e0 type is 7 I1107 11:51:30.442186 564606 scope.cc:203] Create variable create_parameter_0.w_0@GRAD I1107 11:51:30.442193 564606 interpreter_util.cc:1146] Create Variable create_parameter_0.w_0@GRAD locally, which pointer is 0x59b60aec6ff0 type is 7 I1107 11:51:30.442196 564606 interpreter_util.cc:1141] Create Variable create_parameter_1.w_0 global, which pointer is 0x59b609564400 type is 7 I1107 11:51:30.442214 564606 scope.cc:203] Create variable create_parameter_1.w_0@GRAD I1107 11:51:30.442216 564606 interpreter_util.cc:1146] Create Variable create_parameter_1.w_0@GRAD locally, which pointer is 0x59b60aedceb0 type is 7 I1107 11:51:30.442220 564606 interpreter_util.cc:1141] Create Variable feed global, which pointer is 0x59b60a20e3c0 type is 9 I1107 11:51:30.442224 564606 interpreter_util.cc:1141] Create Variable fetch global, which pointer is 0x59b60a215670 type is 10 I1107 11:51:30.442226 564606 scope.cc:203] Create variable input I1107 11:51:30.442231 564606 interpreter_util.cc:1146] Create Variable input locally, which pointer is 0x59b60aec1360 type is 7 I1107 11:51:30.442237 564606 scope.cc:203] Create variable linear_0.tmp_0 I1107 11:51:30.442242 564606 interpreter_util.cc:1146] Create Variable linear_0.tmp_0 locally, which pointer is 0x59b6086c2b00 type is 7 I1107 11:51:30.442250 564606 scope.cc:203] Create variable linear_0.tmp_0@GRAD I1107 11:51:30.442253 564606 interpreter_util.cc:1146] Create Variable linear_0.tmp_0@GRAD locally, which pointer is 0x59b60aec3f30 type is 7 I1107 11:51:30.442258 564606 scope.cc:203] Create variable linear_0.tmp_1 I1107 11:51:30.442261 564606 interpreter_util.cc:1146] Create Variable linear_0.tmp_1 locally, which pointer is 0x59b60ae9fb30 type is 7 I1107 11:51:30.442266 564606 scope.cc:203] Create variable linear_0.tmp_1@GRAD I1107 11:51:30.442270 564606 interpreter_util.cc:1146] Create Variable linear_0.tmp_1@GRAD locally, which pointer is 0x59b609553230 type is 7 I1107 11:51:30.442276 564606 scope.cc:203] Create variable sum_0.tmp_0 I1107 11:51:30.442278 564606 interpreter_util.cc:1146] Create Variable sum_0.tmp_0 locally, which pointer is 0x59b60aeb4610 type is 7 I1107 11:51:30.442283 564606 scope.cc:203] Create variable sum_0.tmp_0@GRAD I1107 11:51:30.442286 564606 interpreter_util.cc:1146] Create Variable sum_0.tmp_0@GRAD locally, which pointer is 0x59b60aec5e70 type is 7 I1107 11:51:30.442420 564606 interpreter_util.cc:572] Static build: 0 I1107 11:51:30.442426 564606 conditional_block_op_helper.cc:108] Found conditional_block op num: 0, conditional_block_grad op num: 0 I1107 11:51:30.442430 564606 pylayer_op_helper.cc:106] Found pylayer op num: 0, pylayer_grad op num: 0 I1107 11:51:30.442436 564606 while_op_helper.cc:154] Found while op num: 0, while grad op num: 0 I1107 11:51:30.442442 564606 recurrent_op_helper.cc:259] Found recurrent op num: 0, recurrent grad op num: 0 I1107 11:51:30.442524 564606 operator.cc:2229] op type:feed, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.442539 564606 interpreter_util.cc:821] feed : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.442620 564606 operator.cc:2229] op type:matmul_v2, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.442629 564606 interpreter_util.cc:821] matmul_v2 : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.442643 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: matmul; inputs: X, Y; attributes: trans_x, trans_y; outputs: Out I1107 11:51:30.442726 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee00600), and remaining 0 I1107 11:51:30.442739 564606 matmul_kernel_impl.h:374] MatMul's case 8 I1107 11:51:30.491533 564606 dynamic_loader.cc:227] Try to find library: libcublas.so from default system path. I1107 11:51:30.575994 564606 operator.cc:2229] op type:share_buffer, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.576023 564606 interpreter_util.cc:821] share_buffer : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.576045 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: share_buffer; inputs: X; attributes: share_dims_and_dtype; outputs: Out, XOut I1107 11:51:30.576133 564606 operator.cc:2229] op type:elementwise_add, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.576138 564606 interpreter_util.cc:821] elementwise_add : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.576208 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=6, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.576766 564606 operator.cc:2229] op type:reduce_sum, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.576774 564606 interpreter_util.cc:821] reduce_sum : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.576784 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: sum_raw; inputs: X; attributes: dim, keep_dim, reduce_all, out_dtype; outputs: Out I1107 11:51:30.576845 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee61400), and remaining 0 I1107 11:51:30.580919 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee61600), and remaining 0 I1107 11:51:30.582314 564606 operator.cc:2229] op type:fill_constant, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.582326 564606 interpreter_util.cc:821] fill_constant : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.582355 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.582370 564606 interpreter_util.cc:624] Standalone Executor is Used. I1107 11:51:30.582404 564606 operator.cc:2229] op type:reduce_sum_grad, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.582412 564606 interpreter_util.cc:821] reduce_sum_grad : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.582422 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: sum_grad; inputs: X, Out@GRAD; attributes: dim, keep_dim, reduce_all; outputs: X@GRAD I1107 11:51:30.582471 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=6, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.583714 564606 operator.cc:2229] op type:share_buffer, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.583725 564606 interpreter_util.cc:821] share_buffer : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.583736 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: share_buffer; inputs: X; attributes: share_dims_and_dtype; outputs: Out, XOut I1107 11:51:30.583791 564606 operator.cc:2229] op type:elementwise_add_grad, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.583796 564606 interpreter_util.cc:821] elementwise_add_grad : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.583912 564606 operator.cc:2229] op type:matmul_v2_grad, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.583920 564606 interpreter_util.cc:821] matmul_v2_grad : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.583928 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: matmul_grad; inputs: X, Y, Out@GRAD; attributes: trans_x, trans_y; outputs: X@GRAD, Y@GRAD I1107 11:51:30.584054 564606 operator.cc:2229] op type:fetch_v2, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(cpu)]; library_type[PLAIN]} I1107 11:51:30.584081 564606 interpreter_util.cc:821] fetch_v2 : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(cpu)]; library_type[PLAIN]} I1107 11:51:30.584095 564606 scope.cc:203] Create variable sum_0.tmp_0_device_Place(gpu:0)_Place(cpu) I1107 11:51:30.584102 564606 data_transfer.cc:398] Create Variable sum_0.tmp_0_device_Place(gpu:0)_Place(cpu) locally, which pointer is 0x59b60b0dea20Variable Type 7 I1107 11:51:30.584127 564606 data_transfer.cc:441] Insert memcpy_d2h with sum_0.tmp_0(Place(gpu:0)) -> sum_0.tmp_0_device_Place(gpu:0)_Place(cpu)(Place(cpu)). I1107 11:51:30.584151 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: memcpy_d2h; inputs: X; attributes: dst_place_type; outputs: Out I1107 11:51:30.584187 564606 operator.cc:2229] op type:memcpy_d2h, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.584208 564606 tensor_utils.cc:57] TensorCopy from Place(gpu:0) to Place(cpu) I1107 11:51:30.584283 564606 data_transfer.cc:234] Run memcpy_d2h done. I1107 11:51:30.584309 564606 fetch_v2_op.cc:143] Fetch variable sum_0.tmp_0_device_Place(gpu:0)_Place(cpu)'s 0 column. I1107 11:51:30.584329 564606 operator.cc:2229] op type:fetch_v2, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(cpu)]; library_type[PLAIN]} I1107 11:51:30.584334 564606 interpreter_util.cc:821] fetch_v2 : finally selected kernel_key: {data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(cpu)]; library_type[PLAIN]} I1107 11:51:30.584342 564606 scope.cc:203] Create variable create_parameter_0.w_0@GRAD_device_Place(gpu:0)_Place(cpu) I1107 11:51:30.584347 564606 data_transfer.cc:398] Create Variable create_parameter_0.w_0@GRAD_device_Place(gpu:0)_Place(cpu) locally, which pointer is 0x59b60b0e5b30Variable Type 7 I1107 11:51:30.584355 564606 data_transfer.cc:441] Insert memcpy_d2h with create_parameter_0.w_0@GRAD(Place(gpu:0)) -> create_parameter_0.w_0@GRAD_device_Place(gpu:0)_Place(cpu)(Place(cpu)). I1107 11:51:30.584363 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: memcpy_d2h; inputs: X; attributes: dst_place_type; outputs: Out I1107 11:51:30.584372 564606 operator.cc:2229] op type:memcpy_d2h, expected_kernel_key:{data_type[float]; data_layout[Undefined(AnyLayout)]; place[Place(gpu:0)]; library_type[PLAIN]} I1107 11:51:30.584381 564606 tensor_utils.cc:57] TensorCopy 2, 3 from Place(gpu:0) to Place(cpu) I1107 11:51:30.584401 564606 data_transfer.cc:234] Run memcpy_d2h done. I1107 11:51:30.584411 564606 fetch_v2_op.cc:143] Fetch variable create_parameter_0.w_0@GRAD_device_Place(gpu:0)_Place(cpu)'s 1 column. I1107 11:51:30.585007 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: matmul; inputs: X, Y; attributes: trans_x, trans_y; outputs: Out I1107 11:51:30.585031 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: share_buffer; inputs: X; attributes: share_dims_and_dtype; outputs: Out, XOut I1107 11:51:30.585059 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: share_buffer; inputs: X; attributes: share_dims_and_dtype; outputs: Out, XOut I1107 11:51:30.585079 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: memcpy_d2h; inputs: X; attributes: dst_place_type; outputs: Out I1107 11:51:30.585090 564606 infershape_utils.cc:546] BuildInferMetaContext: op kernel signature - Kernel Signature - name: memcpy_d2h; inputs: X; attributes: dst_place_type; outputs: Out I1107 11:51:30.585881 564606 pybind.cc:1791] Cannot use get_all_custom_device_type because you have installedCPU/GPU version PaddlePaddle. If you want to use get_all_custom_device_type, please try to install CustomDevice version PaddlePaddle by: pip install paddlepaddle I1107 11:51:30.586541 564606 eager.cc:119] Tensor(weight) have not GradNode, add GradNodeAccumulation0x59b60a27f680 for it. I1107 11:51:30.589195 564606 layout_autotune.cc:84] The number of layout agnostic OPs: 626, heavily layout sensitive OPs: 37, lightly layout sensitive OPs: 144 I1107 11:51:30.589262 564606 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:30.589282 564606 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.589336 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=8, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.589560 564606 eager.cc:119] Tensor(bias) have not GradNode, add GradNodeAccumulation0x59b60d80d0f0 for it. I1107 11:51:30.589594 564606 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:30.589602 564606 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.589612 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=4, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.589725 564606 eager.cc:119] Tensor(generated_tensor_0) have not GradNode, add GradNodeAccumulation0x59b60ae9b1e0 for it. I1107 11:51:30.589814 564606 dygraph_functions.cc:62568] Running AD API: matmul I1107 11:51:30.589834 564606 dygraph_functions.cc:62630] { Input: [ ( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.589897 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee61800), and remaining 0 I1107 11:51:30.589905 564606 matmul_kernel_impl.h:374] MatMul's case 8 I1107 11:51:30.589951 564606 grad_node_info.cc:293] Add Edges for slot: 1, the Edge is from MatmulGradNode (addr: 0x59b60aeb9860) to GradNodeAccumulation (addr: 0x59b60a27f680) I1107 11:51:30.589967 564606 dygraph_functions.cc:52623] Running AD API: add I1107 11:51:30.589977 564606 dygraph_functions.cc:52695] { Input: [ ( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.590013 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee61a00), and remaining 0 I1107 11:51:30.590024 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=8, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.590035 564606 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from AddGradNode (addr: 0x59b609550510) to MatmulGradNode (addr: 0x59b60aeb9860) I1107 11:51:30.590041 564606 grad_node_info.cc:293] Add Edges for slot: 1, the Edge is from AddGradNode (addr: 0x59b609550510) to GradNodeAccumulation (addr: 0x59b60d80d0f0) I1107 11:51:30.590086 564606 dygraph_functions.cc:68459] Running AD API: sum I1107 11:51:30.590094 564606 dygraph_functions.cc:68515] { Input: [ ( x , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.590134 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee61c00), and remaining 0 I1107 11:51:30.590148 564606 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from SumGradNode (addr: 0x59b60950ac60) to AddGradNode (addr: 0x59b609550510) I1107 11:51:30.590214 564606 backward.cc:431] Run in Backward I1107 11:51:30.590219 564606 backward.cc:113] Start Backward I1107 11:51:30.590229 564606 backward.cc:196] Fill grad input tensor 0 with 1.0 I1107 11:51:30.590252 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.590271 564606 backward.cc:254] Preparing GradNode:SumGradNode addr:0x59b60950ac60 I1107 11:51:30.590279 564606 nodes.cc:39935] Running AD API GRAD: sum_grad I1107 11:51:30.590309 564606 nodes.cc:39985] { Input: [ ( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.590335 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee61e00), and remaining 0 I1107 11:51:30.590345 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=8, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.590358 564606 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:30.590364 564606 backward.cc:323] Node: SumGradNode addr:0x59b60950ac60, Found pending node: AddGradNode addr: 0x59b609550510 I1107 11:51:30.590370 564606 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:30.590379 564606 backward.cc:254] Preparing GradNode:AddGradNode addr:0x59b609550510 I1107 11:51:30.590389 564606 nodes.cc:31050] Running AD API GRAD: add_grad I1107 11:51:30.590404 564606 nodes.cc:31126] { Input: [ ( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.590433 564606 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:30.590438 564606 backward.cc:323] Node: AddGradNode addr:0x59b609550510, Found pending node: MatmulGradNode addr: 0x59b60aeb9860 I1107 11:51:30.590441 564606 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:30.590446 564606 backward.cc:323] Node: AddGradNode addr:0x59b609550510, Found pending node: GradNodeAccumulation addr: 0x59b60d80d0f0 I1107 11:51:30.590451 564606 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:30.590453 564606 backward.cc:254] Preparing GradNode:GradNodeAccumulation addr:0x59b60d80d0f0 I1107 11:51:30.590459 564606 accumulation_node.cc:157] Running AD API Grad: GradNodeAccumulation I1107 11:51:30.590466 564606 accumulation_node.cc:40] Move Tensor ptr: 0x59b60d810630 I1107 11:51:30.590467 564606 accumulation_node.cc:193] Finish AD API Grad: GradNodeAccumulation I1107 11:51:30.590472 564606 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:30.590477 564606 backward.cc:254] Preparing GradNode:MatmulGradNode addr:0x59b60aeb9860 I1107 11:51:30.590487 564606 nodes.cc:35691] Running AD API GRAD: matmul_grad I1107 11:51:30.590498 564606 nodes.cc:35748] { Input: [ ( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.590523 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee62000), and remaining 0 I1107 11:51:30.590556 564606 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:30.590561 564606 backward.cc:323] Node: MatmulGradNode addr:0x59b60aeb9860, Found pending node: GradNodeAccumulation addr: 0x59b60a27f680 I1107 11:51:30.590564 564606 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:30.590569 564606 backward.cc:254] Preparing GradNode:GradNodeAccumulation addr:0x59b60a27f680 I1107 11:51:30.590574 564606 accumulation_node.cc:157] Running AD API Grad: GradNodeAccumulation I1107 11:51:30.590579 564606 accumulation_node.cc:40] Move Tensor ptr: 0x59b60922e540 I1107 11:51:30.590582 564606 accumulation_node.cc:193] Finish AD API Grad: GradNodeAccumulation I1107 11:51:30.590586 564606 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:30.591015 564606 eager.cc:119] Tensor(learning_rate_0) have not GradNode, add GradNodeAccumulation0x59b60aede100 for it. I1107 11:51:30.591050 564606 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:30.591058 564606 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.591069 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.591156 564606 eager.cc:119] Tensor(weight_moment1_0) have not GradNode, add GradNodeAccumulation0x59b60aee09a0 for it. I1107 11:51:30.591197 564606 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:30.591210 564606 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.591224 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee62200), and remaining 0 I1107 11:51:30.591231 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=8, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.591259 564606 eager.cc:119] Tensor(weight_moment2_0) have not GradNode, add GradNodeAccumulation0x59b60ae9d1f0 for it. I1107 11:51:30.591292 564606 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:30.591298 564606 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.591310 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee62400), and remaining 0 I1107 11:51:30.591315 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=8, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.591343 564606 eager.cc:119] Tensor(weight_beta1_pow_acc_0) have not GradNode, add GradNodeAccumulation0x59b60d4a6080 for it. I1107 11:51:30.591379 564606 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:30.591387 564606 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.591426 564606 eager.cc:119] Tensor(weight_beta2_pow_acc_0) have not GradNode, add GradNodeAccumulation0x59b60ae9b890 for it. I1107 11:51:30.591440 564606 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:30.591445 564606 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.591480 564606 eager.cc:119] Tensor(bias_moment1_0) have not GradNode, add GradNodeAccumulation0x59b6095587b0 for it. I1107 11:51:30.591517 564606 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:30.591523 564606 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.591534 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee62600), and remaining 0 I1107 11:51:30.591542 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=4, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.591567 564606 eager.cc:119] Tensor(bias_moment2_0) have not GradNode, add GradNodeAccumulation0x59b609553940 for it. I1107 11:51:30.591599 564606 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:30.591605 564606 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.591616 564606 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7778fee62800), and remaining 0 I1107 11:51:30.591622 564606 gpu_launch_config.h:156] Get 1-D launch config: numel=4, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:30.591648 564606 eager.cc:119] Tensor(bias_beta1_pow_acc_0) have not GradNode, add GradNodeAccumulation0x59b60954b750 for it. I1107 11:51:30.591660 564606 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:30.591667 564606 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.591687 564606 eager.cc:119] Tensor(bias_beta2_pow_acc_0) have not GradNode, add GradNodeAccumulation0x59b60c932a30 for it. I1107 11:51:30.591698 564606 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:30.591704 564606 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:30.591753 564606 dygraphfunctions.cc:2692] Running AD API: adam I1107 11:51:30.591773 564606 dygraph_functions.cc:2777] { Input: [ ( param , [[ Not specified tensor log level ]]),
( grad , [[ Not specified tensor log level ]]),
( learning_rate , [[ Not specified tensor log level ]]),
( moment1 , [[ Not specified tensor log level ]]),
( moment2 , [[ Not specified tensor log level ]]),
( beta1_pow , [[ Not specified tensor log level ]]),
( beta2_pow , [[ Not specified tensor log level ]]),
( master_param , [{ UnDefinedTensor }]),
( skip_update , [{ UnDefinedTensor }]), ]} I1107 11:51:30.591802 564606 multiary.cc:184] dims of Beta1Pow : [1] I1107 11:51:30.591807 564606 multiary.cc:192] dims of Beta2Pow : [1] I1107 11:51:30.591814 564606 adam_kernel.cu:187] beta1_pow.numel() : 1beta2_pow.numel() : 1 I1107 11:51:30.591820 564606 adam_kernel.cu:189] param.numel(): 8 I1107 11:51:30.592195 564606 dygraphfunctions.cc:2692] Running AD API: adam I1107 11:51:30.592226 564606 dygraph_functions.cc:2777] { Input: [ ( param , [[ Not specified tensor log level ]]),
( grad , [[ Not specified tensor log level ]]),
( learning_rate , [[ Not specified tensor log level ]]),
( moment1 , [[ Not specified tensor log level ]]),
( moment2 , [[ Not specified tensor log level ]]),
( beta1_pow , [[ Not specified tensor log level ]]),
( beta2_pow , [[ Not specified tensor log level ]]),
( master_param , [{ UnDefinedTensor }]),
( skip_update , [{ UnDefinedTensor }]), ]} I1107 11:51:30.592237 564606 multiary.cc:184] dims of Beta1Pow : [1] I1107 11:51:30.592242 564606 multiary.cc:192] dims of Beta2Pow : [1] I1107 11:51:30.592245 564606 adam_kernel.cu:187] beta1_pow.numel() : 1beta2_pow.numel() : 1 I1107 11:51:30.592249 564606 adam_kernel.cu:189] param.numel(): 4 PaddlePaddle works well on 1 GPU. WARNING: Logging before InitGoogleLogging() is written to STDERR I1107 11:51:31.068277 564864 dynamic_loader.cc:205] Set paddle lib path : /root/.pyenv/versions/3.9.20/envs/gzc_paddle/lib/python3.9/site-packages/paddle/libs WARNING: Logging before InitGoogleLogging() is written to STDERR I1107 11:51:31.068718 564862 dynamic_loader.cc:205] Set paddle lib path : /root/.pyenv/versions/3.9.20/envs/gzc_paddle/lib/python3.9/site-packages/paddle/libs WARNING: Logging before InitGoogleLogging() is written to STDERR I1107 11:51:31.083002 564858 dynamic_loader.cc:205] Set paddle lib path : /root/.pyenv/versions/3.9.20/envs/gzc_paddle/lib/python3.9/site-packages/paddle/libs WARNING: Logging before InitGoogleLogging() is written to STDERR I1107 11:51:31.093663 564860 dynamic_loader.cc:205] Set paddle lib path : /root/.pyenv/versions/3.9.20/envs/gzc_paddle/lib/python3.9/site-packages/paddle/libs I1107 11:51:31.418996 564862 init.cc:97] Before Parse: argc is 2, Init commandline: dummy --tryfromenv=gpu_allocator_retry_time,max_inplace_grad_add,benchmark,sync_nccl_allreduce,sort_sum_gradient,pir_apply_inplace_pass,tensor_operants_mode,inner_op_parallelism,enable_graph_multi_node_sampling,use_fast_math,enable_pir_with_pt_in_dy2st,initial_cpu_memory_in_mb,rocksdb_path,enable_all2all_use_fp16,gpugraph_parallel_stream_num,gpu_memory_limit_mb,print_ir,multiple_of_cupti_buffer_size,gpugraph_enable_hbm_table_collision_stat,query_dest_rank_by_multi_node,use_virtual_memory_auto_growth,enable_adjust_op_order,local_exe_sub_scope_limit,graph_get_neighbor_id,allocator_strategy,graph_edges_split_debug,fuse_parameter_groups_size,sync_after_alloc,apply_pass_to_program,call_stack_level,graph_neighbor_size_percent,cache_inference_while_scope,gpugraph_offload_gather_copy_maxsize,enable_auto_rdma_trans,multi_node_sample_use_gpu_table,cudnn_exhaustive_search,communicator_max_merge_var_num,use_autotune,paddle_num_threads,cudnn_exhaustive_search_times,use_pinned_memory,fast_eager_deletion_mode,new_executor_serial_run,dynamic_static_unified_comm,get_host_by_name_time,gpugraph_storage_mode,gpugraph_force_device_batch_num_equal,benchmark_nccl,tracer_mkldnn_ops_on,ir_inplace_kernel_blacklist,run_kp_kernel,check_nan_inf_level,enable_auto_detect_gpu_topo,use_stride_kernel,cudnn_batchnorm_spatial_persistent,einsum_opt,memory_fraction_of_eager_deletion,fraction_of_cpu_memory_to_use,enable_gpu_memory_usage_log,graph_edges_split_only_by_src_id,gpugraph_hbm_table_load_factor,use_stream_safe_cuda_allocator,new_executor_use_cuda_graph,gpugraph_parallel_copyer_split_maxsize,selected_gpus,enable_opt_get_features,fraction_of_gpu_memory_to_use,communicator_send_queue_size,gpugraph_sparse_table_storage_mode,use_auto_growth_pinned_allocator,enable_api_kernel_fallback,enable_neighbor_list_use_uva,jit_engine_type,pir_subgraph_saving_dir,dist_threadpool_size,npu_storage_format,enable_async_trace,enable_cublas_tensor_op_math,nccl_blocking_wait,gpugraph_enable_segment_merge_grads,check_kernel_launch,fleet_executor_with_standalone,log_memory_stats,static_executor_perfstat_filepath,gpugraph_offload_param_stat,search_cache_max_number,executor_log_deps_every_microseconds,reader_queue_speed_test_mode,eager_delete_scope,new_executor_sequential_run,convert_all_blocks,set_to_1d,graph_metapath_split_opt,graph_edges_debug_node_num,enable_record_memory,auto_growth_chunk_size_in_mb,communicator_is_sgd_optimizer,enable_tracker_all2all,embedding_deterministic,new_executor_use_inplace,new_executor_use_local_scope,prim_enabled,free_when_no_cache_hit,async_trace_count,enable_sparse_inner_gather,low_precision_op_list,free_idle_chunk,enable_pir_in_executor_trace_run,graph_load_in_parallel,use_cuda_managed_memory,cudnn_deterministic,gpugraph_offload_param_extends,tracer_mkldnn_ops_off,init_allocated_mem,enable_dependency_builder_debug_info,enable_dump_main_program,reallocate_gpu_memory_in_mb,gpugraph_slot_feasign_max_num,pe_profile_fname,use_mkldnn,enable_gpu_memory_usage_log_mb,gpugraph_load_node_list_into_hbm,gemm_use_half_precision_compute_type,enable_exit_when_partial_worker,use_system_allocator,print_allocator_trace_info,conv_workspace_size_limit,trt_ibuilder_cache,gpugraph_enable_gpu_direct_access,check_nan_inf,gpugraph_dedup_pull_push_mode,use_shm_cache,gpugraph_debug_gpu_memory,add_dependency_for_communication_op,fuse_parameter_memory_size,tracer_profile_fname,fraction_of_cuda_pinned_memory_to_use,initial_gpu_memory_in_mb,alloc_fill_value,conv2d_disable_cudnn,host_trace_level,allreduce_record_one_event,cpu_deterministic,graph_edges_split_mode,rpc_send_thread_num,gpugraph_enable_print_op_debug,enable_unused_var_check,cublaslt_exhaustive_search_times,enable_pir_api,dygraph_debug,graph_edges_debug_node_id,graph_embedding_split_infer_mode,enable_pir_in_executor,gpugraph_merge_grads_segment_size,eager_delete_tensor_gb,new_executor_static_build,print_sub_graph_dir I1107 11:51:31.419076 564862 init.cc:105] After Parse: argc is 2 I1107 11:51:31.421170 564864 init.cc:97] Before Parse: argc is 2, Init commandline: dummy --tryfromenv=log_memory_stats,pir_subgraph_saving_dir,tensor_operants_mode,enable_dump_main_program,enable_opt_get_features,print_sub_graph_dir,gpugraph_slot_feasign_max_num,gpugraph_enable_print_op_debug,get_host_by_name_time,check_nan_inf,async_trace_count,gpugraph_enable_segment_merge_grads,new_executor_use_inplace,use_auto_growth_pinned_allocator,enable_pir_with_pt_in_dy2st,enable_neighbor_list_use_uva,gpugraph_offload_gather_copy_maxsize,print_allocator_trace_info,gpugraph_force_device_batch_num_equal,enable_exit_when_partial_worker,graph_embedding_split_infer_mode,eager_delete_tensor_gb,fuse_parameter_memory_size,fraction_of_gpu_memory_to_use,use_system_allocator,eager_delete_scope,graph_edges_split_mode,gpugraph_storage_mode,use_stream_safe_cuda_allocator,cudnn_exhaustive_search_times,multiple_of_cupti_buffer_size,cudnn_exhaustive_search,gpugraph_load_node_list_into_hbm,new_executor_sequential_run,cudnn_batchnorm_spatial_persistent,sync_after_alloc,conv_workspace_size_limit,enable_pir_in_executor_trace_run,alloc_fill_value,fraction_of_cpu_memory_to_use,enable_adjust_op_order,auto_growth_chunk_size_in_mb,enable_sparse_inner_gather,gpugraph_parallel_copyer_split_maxsize,run_kp_kernel,tracer_mkldnn_ops_off,set_to_1d,selected_gpus,allreduce_record_one_event,enable_all2all_use_fp16,fleet_executor_with_standalone,enable_auto_detect_gpu_topo,search_cache_max_number,pe_profile_fname,jit_engine_type,gpugraph_dedup_pull_push_mode,new_executor_use_cuda_graph,ir_inplace_kernel_blacklist,communicator_max_merge_var_num,reallocate_gpu_memory_in_mb,convert_all_blocks,free_when_no_cache_hit,graph_edges_split_only_by_src_id,communicator_send_queue_size,enable_pir_api,apply_pass_to_program,max_inplace_grad_add,tracer_profile_fname,paddle_num_threads,add_dependency_for_communication_op,gemm_use_half_precision_compute_type,gpugraph_enable_gpu_direct_access,benchmark_nccl,use_autotune,fraction_of_cuda_pinned_memory_to_use,graph_load_in_parallel,use_pinned_memory,local_exe_sub_scope_limit,gpu_memory_limit_mb,fuse_parameter_groups_size,enable_api_kernel_fallback,graph_edges_debug_node_num,enable_dependency_builder_debug_info,inner_op_parallelism,enable_gpu_memory_usage_log,gpugraph_offload_param_extends,conv2d_disable_cudnn,cublaslt_exhaustive_search_times,cudnn_deterministic,gpugraph_merge_grads_segment_size,prim_enabled,enable_tracker_all2all,check_kernel_launch,enable_unused_var_check,executor_log_deps_every_microseconds,fast_eager_deletion_mode,gpugraph_sparse_table_storage_mode,graph_edges_split_debug,gpugraph_hbm_table_load_factor,sort_sum_gradient,dist_threadpool_size,cache_inference_while_scope,dygraph_debug,low_precision_op_list,new_executor_use_local_scope,enable_gpu_memory_usage_log_mb,enable_cublas_tensor_op_math,enable_async_trace,new_executor_static_build,use_cuda_managed_memory,static_executor_perfstat_filepath,nccl_blocking_wait,new_executor_serial_run,cpu_deterministic,npu_storage_format,use_shm_cache,sync_nccl_allreduce,use_virtual_memory_auto_growth,multi_node_sample_use_gpu_table,graph_get_neighbor_id,memory_fraction_of_eager_deletion,reader_queue_speed_test_mode,init_allocated_mem,gpugraph_offload_param_stat,benchmark,graph_neighbor_size_percent,pir_apply_inplace_pass,host_trace_level,initial_gpu_memory_in_mb,enable_auto_rdma_trans,check_nan_inf_level,enable_graph_multi_node_sampling,free_idle_chunk,graph_edges_debug_node_id,rpc_send_thread_num,print_ir,communicator_is_sgd_optimizer,enable_record_memory,dynamic_static_unified_comm,gpugraph_debug_gpu_memory,enable_pir_in_executor,allocator_strategy,embedding_deterministic,tracer_mkldnn_ops_on,use_mkldnn,gpu_allocator_retry_time,trt_ibuilder_cache,gpugraph_parallel_stream_num,use_fast_math,call_stack_level,initial_cpu_memory_in_mb,gpugraph_enable_hbm_table_collision_stat,graph_metapath_split_opt,use_stride_kernel,rocksdb_path,query_dest_rank_by_multi_node,einsum_opt I1107 11:51:31.421259 564864 init.cc:105] After Parse: argc is 2 I1107 11:51:31.429437 564858 init.cc:97] Before Parse: argc is 2, Init commandline: dummy --tryfromenv=use_fast_math,free_idle_chunk,convert_all_blocks,enable_pir_in_executor_trace_run,enable_all2all_use_fp16,eager_delete_scope,use_mkldnn,gpugraph_slot_feasign_max_num,use_autotune,use_system_allocator,new_executor_use_inplace,apply_pass_to_program,sync_nccl_allreduce,enable_auto_rdma_trans,use_auto_growth_pinned_allocator,dygraph_debug,cudnn_deterministic,graph_neighbor_size_percent,communicator_max_merge_var_num,enable_opt_get_features,communicator_send_queue_size,tracer_mkldnn_ops_on,max_inplace_grad_add,graph_get_neighbor_id,async_trace_count,tensor_operants_mode,einsum_opt,print_allocator_trace_info,enable_dump_main_program,gpugraph_force_device_batch_num_equal,embedding_deterministic,query_dest_rank_by_multi_node,graph_edges_split_only_by_src_id,free_when_no_cache_hit,gpugraph_enable_gpu_direct_access,executor_log_deps_every_microseconds,npu_storage_format,init_allocated_mem,gpugraph_dedup_pull_push_mode,new_executor_serial_run,gpugraph_offload_param_extends,gpugraph_parallel_copyer_split_maxsize,add_dependency_for_communication_op,enable_exit_when_partial_worker,fleet_executor_with_standalone,enable_api_kernel_fallback,paddle_num_threads,use_shm_cache,new_executor_use_cuda_graph,static_executor_perfstat_filepath,new_executor_sequential_run,search_cache_max_number,cpu_deterministic,rpc_send_thread_num,graph_metapath_split_opt,ir_inplace_kernel_blacklist,gpugraph_debug_gpu_memory,host_trace_level,allreduce_record_one_event,gpugraph_enable_hbm_table_collision_stat,gpugraph_enable_print_op_debug,multiple_of_cupti_buffer_size,enable_graph_multi_node_sampling,fraction_of_cpu_memory_to_use,run_kp_kernel,enable_sparse_inner_gather,eager_delete_tensor_gb,get_host_by_name_time,gpugraph_hbm_table_load_factor,enable_gpu_memory_usage_log,fuse_parameter_memory_size,new_executor_static_build,reallocate_gpu_memory_in_mb,enable_pir_in_executor,gpugraph_offload_param_stat,graph_edges_split_mode,cudnn_exhaustive_search_times,enable_async_trace,set_to_1d,fuse_parameter_groups_size,nccl_blocking_wait,cudnn_batchnorm_spatial_persistent,reader_queue_speed_test_mode,conv2d_disable_cudnn,memory_fraction_of_eager_deletion,enable_unused_var_check,trt_ibuilder_cache,gemm_use_half_precision_compute_type,initial_gpu_memory_in_mb,conv_workspace_size_limit,enable_cublas_tensor_op_math,pir_subgraph_saving_dir,gpugraph_parallel_stream_num,tracer_profile_fname,use_stream_safe_cuda_allocator,gpugraph_storage_mode,prim_enabled,use_pinned_memory,gpu_allocator_retry_time,use_cuda_managed_memory,check_nan_inf,auto_growth_chunk_size_in_mb,enable_pir_api,dist_threadpool_size,fraction_of_gpu_memory_to_use,use_virtual_memory_auto_growth,gpugraph_enable_segment_merge_grads,tracer_mkldnn_ops_off,graph_embedding_split_infer_mode,fast_eager_deletion_mode,inner_op_parallelism,communicator_is_sgd_optimizer,check_kernel_launch,graph_edges_split_debug,gpu_memory_limit_mb,fraction_of_cuda_pinned_memory_to_use,check_nan_inf_level,graph_edges_debug_node_id,cache_inference_while_scope,cudnn_exhaustive_search,enable_dependency_builder_debug_info,alloc_fill_value,gpugraph_offload_gather_copy_maxsize,rocksdb_path,gpugraph_merge_grads_segment_size,benchmark_nccl,log_memory_stats,dynamic_static_unified_comm,enable_auto_detect_gpu_topo,call_stack_level,pe_profile_fname,enable_gpu_memory_usage_log_mb,low_precision_op_list,sort_sum_gradient,enable_tracker_all2all,enable_pir_with_pt_in_dy2st,print_sub_graph_dir,enable_record_memory,cublaslt_exhaustive_search_times,pir_apply_inplace_pass,local_exe_sub_scope_limit,enable_neighbor_list_use_uva,use_stride_kernel,gpugraph_load_node_list_into_hbm,new_executor_use_local_scope,jit_engine_type,sync_after_alloc,enable_adjust_op_order,selected_gpus,initial_cpu_memory_in_mb,multi_node_sample_use_gpu_table,gpugraph_sparse_table_storage_mode,allocator_strategy,graph_load_in_parallel,benchmark,graph_edges_debug_node_num,print_ir I1107 11:51:31.429502 564858 init.cc:105] After Parse: argc is 2 I1107 11:51:31.442871 564860 init.cc:97] Before Parse: argc is 2, Init commandline: dummy --tryfromenv=enable_gpu_memory_usage_log_mb,search_cache_max_number,print_allocator_trace_info,npu_storage_format,gpu_allocator_retry_time,pe_profile_fname,graph_embedding_split_infer_mode,gpugraph_merge_grads_segment_size,graph_edges_split_only_by_src_id,memory_fraction_of_eager_deletion,tracer_mkldnn_ops_on,benchmark,sort_sum_gradient,cudnn_exhaustive_search,check_nan_inf,convert_all_blocks,gpugraph_storage_mode,prim_enabled,enable_pir_with_pt_in_dy2st,graph_get_neighbor_id,enable_record_memory,embedding_deterministic,enable_pir_api,use_cuda_managed_memory,static_executor_perfstat_filepath,sync_after_alloc,enable_api_kernel_fallback,enable_pir_in_executor_trace_run,reader_queue_speed_test_mode,gpugraph_parallel_stream_num,paddle_num_threads,allreduce_record_one_event,apply_pass_to_program,call_stack_level,print_sub_graph_dir,dygraph_debug,rpc_send_thread_num,gpugraph_enable_gpu_direct_access,alloc_fill_value,fraction_of_cuda_pinned_memory_to_use,cublaslt_exhaustive_search_times,graph_load_in_parallel,use_fast_math,gpugraph_offload_param_stat,executor_log_deps_every_microseconds,run_kp_kernel,gpu_memory_limit_mb,graph_edges_split_debug,get_host_by_name_time,fleet_executor_with_standalone,gpugraph_slot_feasign_max_num,new_executor_use_cuda_graph,dist_threadpool_size,gpugraph_offload_gather_copy_maxsize,use_mkldnn,eager_delete_tensor_gb,log_memory_stats,initial_gpu_memory_in_mb,einsum_opt,enable_tracker_all2all,gpugraph_force_device_batch_num_equal,initial_cpu_memory_in_mb,async_trace_count,new_executor_static_build,enable_pir_in_executor,enable_dump_main_program,ir_inplace_kernel_blacklist,enable_all2all_use_fp16,eager_delete_scope,graph_edges_debug_node_id,cpu_deterministic,init_allocated_mem,auto_growth_chunk_size_in_mb,enable_sparse_inner_gather,gpugraph_parallel_copyer_split_maxsize,selected_gpus,gpugraph_sparse_table_storage_mode,add_dependency_for_communication_op,check_kernel_launch,use_virtual_memory_auto_growth,graph_neighbor_size_percent,low_precision_op_list,gpugraph_enable_segment_merge_grads,fraction_of_cpu_memory_to_use,gpugraph_hbm_table_load_factor,conv2d_disable_cudnn,multi_node_sample_use_gpu_table,free_when_no_cache_hit,cudnn_deterministic,enable_auto_rdma_trans,enable_auto_detect_gpu_topo,use_shm_cache,gpugraph_debug_gpu_memory,sync_nccl_allreduce,pir_subgraph_saving_dir,enable_async_trace,enable_gpu_memory_usage_log,conv_workspace_size_limit,gpugraph_offload_param_extends,use_system_allocator,enable_exit_when_partial_worker,set_to_1d,reallocate_gpu_memory_in_mb,nccl_blocking_wait,tensor_operants_mode,query_dest_rank_by_multi_node,fuse_parameter_groups_size,cudnn_batchnorm_spatial_persistent,gpugraph_load_node_list_into_hbm,new_executor_serial_run,print_ir,max_inplace_grad_add,enable_unused_var_check,check_nan_inf_level,tracer_mkldnn_ops_off,gpugraph_enable_print_op_debug,pir_apply_inplace_pass,dynamic_static_unified_comm,new_executor_use_local_scope,enable_dependency_builder_debug_info,gpugraph_dedup_pull_push_mode,enable_cublas_tensor_op_math,local_exe_sub_scope_limit,gpugraph_enable_hbm_table_collision_stat,use_stride_kernel,new_executor_sequential_run,fast_eager_deletion_mode,graph_edges_split_mode,cudnn_exhaustive_search_times,enable_opt_get_features,use_stream_safe_cuda_allocator,jit_engine_type,host_trace_level,graph_edges_debug_node_num,free_idle_chunk,enable_adjust_op_order,use_pinned_memory,new_executor_use_inplace,fuse_parameter_memory_size,communicator_max_merge_var_num,cache_inference_while_scope,graph_metapath_split_opt,rocksdb_path,use_auto_growth_pinned_allocator,multiple_of_cupti_buffer_size,communicator_send_queue_size,trt_ibuilder_cache,tracer_profile_fname,enable_neighbor_list_use_uva,use_autotune,allocator_strategy,communicator_is_sgd_optimizer,enable_graph_multi_node_sampling,fraction_of_gpu_memory_to_use,inner_op_parallelism,benchmark_nccl,gemm_use_half_precision_compute_type I1107 11:51:31.442945 564860 init.cc:105] After Parse: argc is 2 ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='2', default_value='')

I1107 11:51:31.831405 564862 tcp_utils.cc:107] Retry to connect to 127.0.0.1:36617 while the server is not yet listening. ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='')

I1107 11:51:31.870724 564864 tcp_utils.cc:107] Retry to connect to 127.0.0.1:36617 while the server is not yet listening. ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')

I1107 11:51:31.882398 564858 tcp_utils.cc:181] The server starts to listen on IP_ANY:36617 I1107 11:51:31.882613 564858 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36617 ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='')

I1107 11:51:31.902621 564860 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36617 I1107 11:51:31.923488 564860 process_group_nccl.cc:129] ProcessGroupNCCL pgtimeout 1800000 I1107 11:51:32.051061 564860 eager.cc:119] Tensor(linear_0.w_0) have not GradNode, add GradNodeAccumulation0x61d2ea900340 for it. I1107 11:51:32.052687 564860 layout_autotune.cc:84] The number of layout agnostic OPs: 626, heavily layout sensitive OPs: 37, lightly layout sensitive OPs: 144 I1107 11:51:32.052911 564860 dygraph_functions.cc:70087] Running AD API: uniform I1107 11:51:32.052918 564860 dygraph_functions.cc:70107] { Input: []} W1107 11:51:32.054075 564860 gpu_resources.cc:119] Please NOTE: device: 1, GPU Compute Capability: 8.9, Driver API Version: 12.7, Runtime API Version: 12.0 I1107 11:51:32.054250 564860 dynamic_loader.cc:227] Try to find library: libcudnn.so from default system path. W1107 11:51:32.054569 564860 gpu_resources.cc:164] device: 1, cuDNN Version: 9.5. I1107 11:51:32.064901 564860 dynamic_loader.cc:227] Try to find library: libcuda.so from default system path. I1107 11:51:32.065248 564860 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x7c6abae00000), and remaining 0 I1107 11:51:32.066383 564860 eager.cc:119] Tensor(linear_0.b_0) have not GradNode, add GradNodeAccumulation0x61d2eb21ca30 for it. I1107 11:51:32.066469 564860 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:32.066495 564860 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:32.066550 564860 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c6abae00200), and remaining 0 I1107 11:51:32.066569 564860 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:32.067461 564860 eager.cc:119] Tensor(linear_1.w_0) have not GradNode, add GradNodeAccumulation0x61d2eb3a15c0 for it. I1107 11:51:32.067524 564860 dygraph_functions.cc:70087] Running AD API: uniform I1107 11:51:32.067529 564860 dygraph_functions.cc:70107] { Input: []} I1107 11:51:32.067556 564860 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c6abae00400), and remaining 0 I1107 11:51:32.067656 564860 eager.cc:119] Tensor(linear_1.b_0) have not GradNode, add GradNodeAccumulation0x61d2eb3a2430 for it. I1107 11:51:32.067679 564860 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:32.067689 564860 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:32.067704 564860 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c6abae00600), and remaining 0 I1107 11:51:32.067713 564860 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:32.068111 564860 process_group_nccl.cc:702] init nccl rank_in_group: 1, nranks: 4, gid: 0, place key: Place(gpu:1), store_key: nccl_ids/0/0 I1107 11:51:32.068403 564860 dynamic_loader.cc:227] Try to find library: libnccl.so from default system path. I1107 11:51:34.831647 564862 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36617 I1107 11:51:34.871026 564864 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36617 I1107 11:51:34.908504 564862 process_group_nccl.cc:129] ProcessGroupNCCL pgtimeout 1800000 I1107 11:51:34.908720 564864 process_group_nccl.cc:129] ProcessGroupNCCL pgtimeout 1800000 I1107 11:51:35.001463 564858 process_group_nccl.cc:129] ProcessGroupNCCL pgtimeout 1800000 I1107 11:51:35.197360 564862 eager.cc:119] Tensor(linear_0.w_0) have not GradNode, add GradNodeAccumulation0x643fc02c4130 for it. I1107 11:51:35.198612 564862 layout_autotune.cc:84] The number of layout agnostic OPs: 626, heavily layout sensitive OPs: 37, lightly layout sensitive OPs: 144 I1107 11:51:35.198832 564862 dygraph_functions.cc:70087] Running AD API: uniform I1107 11:51:35.198839 564862 dygraph_functions.cc:70107] { Input: []} W1107 11:51:35.199951 564862 gpu_resources.cc:119] Please NOTE: device: 2, GPU Compute Capability: 8.9, Driver API Version: 12.7, Runtime API Version: 12.0 I1107 11:51:35.200111 564862 dynamic_loader.cc:227] Try to find library: libcudnn.so from default system path. W1107 11:51:35.200461 564862 gpu_resources.cc:164] device: 2, cuDNN Version: 9.5. I1107 11:51:35.206266 564864 eager.cc:119] Tensor(linear_0.w_0) have not GradNode, add GradNodeAccumulation0x643843ba1ce0 for it. I1107 11:51:35.207821 564864 layout_autotune.cc:84] The number of layout agnostic OPs: 626, heavily layout sensitive OPs: 37, lightly layout sensitive OPs: 144 I1107 11:51:35.208086 564864 dygraph_functions.cc:70087] Running AD API: uniform I1107 11:51:35.208091 564864 dygraph_functions.cc:70107] { Input: []} W1107 11:51:35.209548 564864 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 8.9, Driver API Version: 12.7, Runtime API Version: 12.0 I1107 11:51:35.209707 564864 dynamic_loader.cc:227] Try to find library: libcudnn.so from default system path. W1107 11:51:35.210011 564864 gpu_resources.cc:164] device: 3, cuDNN Version: 9.5. I1107 11:51:35.210270 564862 dynamic_loader.cc:227] Try to find library: libcuda.so from default system path. I1107 11:51:35.212386 564862 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x77f85ae00000), and remaining 0 I1107 11:51:35.215036 564862 eager.cc:119] Tensor(linear_0.b_0) have not GradNode, add GradNodeAccumulation0x643fc0b51090 for it. I1107 11:51:35.215116 564862 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:35.215137 564862 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.215183 564862 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x77f85ae00200), and remaining 0 I1107 11:51:35.215205 564862 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.216069 564862 eager.cc:119] Tensor(linear_1.w_0) have not GradNode, add GradNodeAccumulation0x643fc0cd5ee0 for it. I1107 11:51:35.216130 564862 dygraph_functions.cc:70087] Running AD API: uniform I1107 11:51:35.216135 564862 dygraph_functions.cc:70107] { Input: []} I1107 11:51:35.216156 564862 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x77f85ae00400), and remaining 0 I1107 11:51:35.216266 564862 eager.cc:119] Tensor(linear_1.b_0) have not GradNode, add GradNodeAccumulation0x643fc0cd6d40 for it. I1107 11:51:35.216290 564862 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:35.216302 564862 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.216316 564862 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x77f85ae00600), and remaining 0 I1107 11:51:35.216324 564862 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.216734 564862 process_group_nccl.cc:702] init nccl rank_in_group: 2, nranks: 4, gid: 0, place key: Place(gpu:2), store_key: nccl_ids/0/0 I1107 11:51:35.217023 564862 dynamic_loader.cc:227] Try to find library: libnccl.so from default system path. I1107 11:51:35.218042 564864 dynamic_loader.cc:227] Try to find library: libcuda.so from default system path. I1107 11:51:35.218358 564864 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x7c8832e00000), and remaining 0 I1107 11:51:35.219518 564864 eager.cc:119] Tensor(linear_0.b_0) have not GradNode, add GradNodeAccumulation0x64384431c5d0 for it. I1107 11:51:35.219609 564864 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:35.219630 564864 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.219679 564864 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c8832e00200), and remaining 0 I1107 11:51:35.219695 564864 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.220618 564864 eager.cc:119] Tensor(linear_1.w_0) have not GradNode, add GradNodeAccumulation0x6438444a1460 for it. I1107 11:51:35.220692 564864 dygraph_functions.cc:70087] Running AD API: uniform I1107 11:51:35.220697 564864 dygraph_functions.cc:70107] { Input: []} I1107 11:51:35.220723 564864 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c8832e00400), and remaining 0 I1107 11:51:35.220826 564864 eager.cc:119] Tensor(linear_1.b_0) have not GradNode, add GradNodeAccumulation0x6438444a22f0 for it. I1107 11:51:35.220849 564864 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:35.220860 564864 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.220875 564864 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c8832e00600), and remaining 0 I1107 11:51:35.220885 564864 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.221338 564864 process_group_nccl.cc:702] init nccl rank_in_group: 3, nranks: 4, gid: 0, place key: Place(gpu:3), store_key: nccl_ids/0/0 I1107 11:51:35.221652 564864 dynamic_loader.cc:227] Try to find library: libnccl.so from default system path. I1107 11:51:35.224038 564858 eager.cc:119] Tensor(linear_0.w_0) have not GradNode, add GradNodeAccumulation0x5abef1012670 for it. I1107 11:51:35.225282 564858 layout_autotune.cc:84] The number of layout agnostic OPs: 626, heavily layout sensitive OPs: 37, lightly layout sensitive OPs: 144 I1107 11:51:35.225502 564858 dygraph_functions.cc:70087] Running AD API: uniform I1107 11:51:35.225508 564858 dygraph_functions.cc:70107] { Input: []} W1107 11:51:35.226593 564858 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.7, Runtime API Version: 12.0 I1107 11:51:35.226743 564858 dynamic_loader.cc:227] Try to find library: libcudnn.so from default system path. W1107 11:51:35.227048 564858 gpu_resources.cc:164] device: 0, cuDNN Version: 9.5. I1107 11:51:35.234910 564858 dynamic_loader.cc:227] Try to find library: libcuda.so from default system path. I1107 11:51:35.235195 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e00000), and remaining 0 I1107 11:51:35.236325 564858 eager.cc:119] Tensor(linear_0.b_0) have not GradNode, add GradNodeAccumulation0x5abef1836d60 for it. I1107 11:51:35.236399 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:35.236418 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.236459 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e00200), and remaining 0 I1107 11:51:35.236474 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.237351 564858 eager.cc:119] Tensor(linear_1.w_0) have not GradNode, add GradNodeAccumulation0x5abef19bbba0 for it. I1107 11:51:35.237411 564858 dygraph_functions.cc:70087] Running AD API: uniform I1107 11:51:35.237416 564858 dygraph_functions.cc:70107] { Input: []} I1107 11:51:35.237437 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e00400), and remaining 0 I1107 11:51:35.237535 564858 eager.cc:119] Tensor(linear_1.b_0) have not GradNode, add GradNodeAccumulation0x5abef19bca30 for it. I1107 11:51:35.237557 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:35.237568 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.237582 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e00600), and remaining 0 I1107 11:51:35.237591 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.237993 564858 process_group_nccl.cc:702] init nccl rank_in_group: 0, nranks: 4, gid: 0, place key: Place(gpu:0), store_key: nccl_ids/0/0 I1107 11:51:35.238263 564858 dynamic_loader.cc:227] Try to find library: libnccl.so from default system path. I1107 11:51:35.238903 564858 comm_context_manager.cc:90] init NCCLCommContext rank: 0, size: 4, unique_comm_key: nccl_ids/0/0, unique_key: NCCLCommContext/nccl_ids/0/0, nccl_id: ac2e8022578c07320828dc0a88280000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 I1107 11:51:35.279462 564860 comm_context_manager.cc:90] init NCCLCommContext rank: 1, size: 4, unique_comm_key: nccl_ids/0/0, unique_key: NCCLCommContext/nccl_ids/0/0, nccl_id: ac2e8022578c07320828dc0a88280000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 I1107 11:51:35.279487 564862 comm_context_manager.cc:90] init NCCLCommContext rank: 2, size: 4, unique_comm_key: nccl_ids/0/0, unique_key: NCCLCommContext/nccl_ids/0/0, nccl_id: ac2e8022578c07320828dc0a88280000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 I1107 11:51:35.320335 564864 comm_context_manager.cc:90] init NCCLCommContext rank: 3, size: 4, unique_comm_key: nccl_ids/0/0, unique_key: NCCLCommContext/nccl_ids/0/0, nccl_id: ac2e8022578c07320828dc0a88280000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 I1107 11:51:35.562363 564862 process_group_nccl.cc:725] Get nccl comm: 0x643fc0cf1ee0 for place_key: Place(gpu:2) on rank_in_group: 2 nranks: 4 gid: 0 I1107 11:51:35.562394 564864 process_group_nccl.cc:725] Get nccl comm: 0x6438444bd670 for place_key: Place(gpu:3) on rank_in_group: 3 nranks: 4 gid: 0 I1107 11:51:35.562404 564858 process_group_nccl.cc:725] Get nccl comm: 0x5abef19d7f50 for place_key: Place(gpu:0) on rank_in_group: 0 nranks: 4 gid: 0 I1107 11:51:35.562404 564860 process_group_nccl.cc:725] Get nccl comm: 0x61d2eb3bd5b0 for place_key: Place(gpu:1) on rank_in_group: 1 nranks: 4 gid: 0 I1107 11:51:35.562464 564862 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x77f85ae00000, recvbuff: 0x77f85ae00000, count: 100, datatype: float32, root: 0, ncclcomm: 0x643fc0cf1ee0, stream: 0x643fbfdaee30, rank_in_group: 2, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 2, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.562492 564858 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x736062e00000, recvbuff: 0x736062e00000, count: 100, datatype: float32, root: 0, ncclcomm: 0x5abef19d7f50, stream: 0x5abeedaf5960, rank_in_group: 0, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 0, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.562491 564864 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c8832e00000, recvbuff: 0x7c8832e00000, count: 100, datatype: float32, root: 0, ncclcomm: 0x6438444bd670, stream: 0x64384356dd10, rank_in_group: 3, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 3, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.562521 564860 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c6abae00000, recvbuff: 0x7c6abae00000, count: 100, datatype: float32, root: 0, ncclcomm: 0x61d2eb3bd5b0, stream: 0x61d2ea47a8e0, rank_in_group: 1, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 1, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.601517 564860 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c6abae00200, recvbuff: 0x7c6abae00200, count: 10, datatype: float32, root: 0, ncclcomm: 0x61d2eb3bd5b0, stream: 0x61d2ea47a8e0, rank_in_group: 1, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 1, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.601598 564860 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c6abae00400, recvbuff: 0x7c6abae00400, count: 10, datatype: float32, root: 0, ncclcomm: 0x61d2eb3bd5b0, stream: 0x61d2ea47a8e0, rank_in_group: 1, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 1, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.601631 564860 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c6abae00600, recvbuff: 0x7c6abae00600, count: 1, datatype: float32, root: 0, ncclcomm: 0x61d2eb3bd5b0, stream: 0x61d2ea47a8e0, rank_in_group: 1, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 1, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.601874 564860 reducer.cc:103] var[linear_0.w_0] 's type is float32 I1107 11:51:35.601887 564860 reducer.cc:103] var[linear_0.b_0] 's type is float32 I1107 11:51:35.601894 564860 reducer.cc:103] var[linear_1.w_0] 's type is float32 I1107 11:51:35.601899 564860 reducer.cc:103] var[linear_1.b_0] 's type is float32 I1107 11:51:35.601940 564860 reducer.cc:486] Start construct the Reducer ... I1107 11:51:35.601948 564860 reducer.cc:534] Start initialize groups .. I1107 11:51:35.601953 564860 reducer.cc:583] InitializeDenseGroups. I1107 11:51:35.601974 564860 reducer.cc:577] The Group[0]:numel: 121 ;var number: 4 [0 1 2 3] I1107 11:51:35.602404 564860 dygraph_functions.cc:60776] Running AD API: gaussian I1107 11:51:35.602411 564860 dygraph_functions.cc:60796] { Input: []} I1107 11:51:35.602491 564860 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x7c6abae00800), and remaining 0 I1107 11:51:35.602643 564858 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x736062e00200, recvbuff: 0x736062e00200, count: 10, datatype: float32, root: 0, ncclcomm: 0x5abef19d7f50, stream: 0x5abeedaf5960, rank_in_group: 0, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 0, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.602715 564858 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x736062e00400, recvbuff: 0x736062e00400, count: 10, datatype: float32, root: 0, ncclcomm: 0x5abef19d7f50, stream: 0x5abeedaf5960, rank_in_group: 0, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 0, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.602743 564858 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x736062e00600, recvbuff: 0x736062e00600, count: 1, datatype: float32, root: 0, ncclcomm: 0x5abef19d7f50, stream: 0x5abeedaf5960, rank_in_group: 0, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 0, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.602914 564858 reducer.cc:103] var[linear_0.w_0] 's type is float32 I1107 11:51:35.602923 564858 reducer.cc:103] var[linear_0.b_0] 's type is float32 I1107 11:51:35.602927 564858 reducer.cc:103] var[linear_1.w_0] 's type is float32 I1107 11:51:35.602931 564858 reducer.cc:103] var[linear_1.b_0] 's type is float32 I1107 11:51:35.602962 564858 reducer.cc:486] Start construct the Reducer ... I1107 11:51:35.602967 564858 reducer.cc:534] Start initialize groups .. I1107 11:51:35.602954 564862 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x77f85ae00200, recvbuff: 0x77f85ae00200, count: 10, datatype: float32, root: 0, ncclcomm: 0x643fc0cf1ee0, stream: 0x643fbfdaee30, rank_in_group: 2, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 2, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.602970 564858 reducer.cc:583] InitializeDenseGroups. I1107 11:51:35.602990 564858 reducer.cc:577] The Group[0]:numel: 121 ;var number: 4 [0 1 2 3] I1107 11:51:35.603018 564862 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x77f85ae00400, recvbuff: 0x77f85ae00400, count: 10, datatype: float32, root: 0, ncclcomm: 0x643fc0cf1ee0, stream: 0x643fbfdaee30, rank_in_group: 2, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 2, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.603044 564862 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x77f85ae00600, recvbuff: 0x77f85ae00600, count: 1, datatype: float32, root: 0, ncclcomm: 0x643fc0cf1ee0, stream: 0x643fbfdaee30, rank_in_group: 2, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 2, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.603212 564862 reducer.cc:103] var[linear_0.w_0] 's type is float32 I1107 11:51:35.603220 564862 reducer.cc:103] var[linear_0.b_0] 's type is float32 I1107 11:51:35.603205 564864 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c8832e00200, recvbuff: 0x7c8832e00200, count: 10, datatype: float32, root: 0, ncclcomm: 0x6438444bd670, stream: 0x64384356dd10, rank_in_group: 3, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 3, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.603225 564862 reducer.cc:103] var[linear_1.w_0] 's type is float32 I1107 11:51:35.603232 564862 reducer.cc:103] var[linear_1.b_0] 's type is float32 I1107 11:51:35.603260 564862 reducer.cc:486] Start construct the Reducer ... I1107 11:51:35.603266 564862 reducer.cc:534] Start initialize groups .. I1107 11:51:35.603267 564862 reducer.cc:583] InitializeDenseGroups. I1107 11:51:35.603283 564862 reducer.cc:577] The Group[0]:numel: 121 ;var number: 4 [0 1 2 3] I1107 11:51:35.603283 564864 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c8832e00400, recvbuff: 0x7c8832e00400, count: 10, datatype: float32, root: 0, ncclcomm: 0x6438444bd670, stream: 0x64384356dd10, rank_in_group: 3, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 3, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.603319 564858 dygraph_functions.cc:60776] Running AD API: gaussian I1107 11:51:35.603319 564864 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c8832e00600, recvbuff: 0x7c8832e00600, count: 1, datatype: float32, root: 0, ncclcomm: 0x6438444bd670, stream: 0x64384356dd10, rank_in_group: 3, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 3, nranks: 4, gid: 0, backend: NCCL I1107 11:51:35.603325 564858 dygraph_functions.cc:60796] { Input: []} I1107 11:51:35.603379 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e00800), and remaining 0 I1107 11:51:35.603543 564864 reducer.cc:103] var[linear_0.w_0] 's type is float32 I1107 11:51:35.603554 564864 reducer.cc:103] var[linear_0.b_0] 's type is float32 I1107 11:51:35.603559 564864 reducer.cc:103] var[linear_1.w_0] 's type is float32 I1107 11:51:35.603564 564864 reducer.cc:103] var[linear_1.b_0] 's type is float32 I1107 11:51:35.603603 564864 reducer.cc:486] Start construct the Reducer ... I1107 11:51:35.603606 564862 dygraph_functions.cc:60776] Running AD API: gaussian I1107 11:51:35.603608 564864 reducer.cc:534] Start initialize groups .. I1107 11:51:35.603610 564862 dygraph_functions.cc:60796] { Input: []} I1107 11:51:35.603612 564864 reducer.cc:583] InitializeDenseGroups. I1107 11:51:35.603636 564864 reducer.cc:577] The Group[0]:numel: 121 ;var number: 4 [0 1 2 3] I1107 11:51:35.603662 564862 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x77f85ae00800), and remaining 0 I1107 11:51:35.604096 564864 dygraph_functions.cc:60776] Running AD API: gaussian I1107 11:51:35.604103 564864 dygraph_functions.cc:60796] { Input: []} I1107 11:51:35.604163 564864 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x7c8832e00800), and remaining 0 I1107 11:51:35.800139 564858 dygraph_functions.cc:62568] Running AD API: matmul I1107 11:51:35.800189 564858 dygraph_functions.cc:62630] { Input: [ ( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.800351 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e00a00), and remaining 0 I1107 11:51:35.800366 564858 matmul_kernel_impl.h:374] MatMul's case 8 I1107 11:51:35.840879 564858 dynamic_loader.cc:227] Try to find library: libcublas.so from default system path. I1107 11:51:35.920360 564858 grad_node_info.cc:293] Add Edges for slot: 1, the Edge is from MatmulGradNode (addr: 0x5abef1b7fd20) to GradNodeAccumulation (addr: 0x5abef1012670) I1107 11:51:35.920406 564858 dygraph_functions.cc:52623] Running AD API: add I1107 11:51:35.920445 564858 dygraph_functions.cc:52695] { Input: [ ( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.920547 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e61800), and remaining 0 I1107 11:51:35.920583 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=100, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.967942 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from AddGradNode (addr: 0x5abef694e240) to MatmulGradNode (addr: 0x5abef1b7fd20) I1107 11:51:35.967962 564858 grad_node_info.cc:293] Add Edges for slot: 1, the Edge is from AddGradNode (addr: 0x5abef694e240) to GradNodeAccumulation (addr: 0x5abef1836d60) I1107 11:51:35.968066 564858 dygraph_functions.cc:62568] Running AD API: matmul I1107 11:51:35.968092 564858 dygraph_functions.cc:62630] { Input: [ ( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.968137 564858 matmul_kernel_impl.h:374] MatMul's case 8 I1107 11:51:35.971632 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from MatmulGradNode (addr: 0x5abefb3c03d0) to AddGradNode (addr: 0x5abef694e240) I1107 11:51:35.971647 564858 grad_node_info.cc:293] Add Edges for slot: 1, the Edge is from MatmulGradNode (addr: 0x5abefb3c03d0) to GradNodeAccumulation (addr: 0x5abef19bbba0) I1107 11:51:35.971655 564858 dygraph_functions.cc:52623] Running AD API: add I1107 11:51:35.971670 564858 dygraph_functions.cc:52695] { Input: [ ( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.971712 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.971729 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from AddGradNode (addr: 0x5abefbbb45b0) to MatmulGradNode (addr: 0x5abefb3c03d0) I1107 11:51:35.971733 564858 grad_node_info.cc:293] Add Edges for slot: 1, the Edge is from AddGradNode (addr: 0x5abefbbb45b0) to GradNodeAccumulation (addr: 0x5abef19bca30) I1107 11:51:35.971820 564858 reducer.cc:680] after forward, then reset count for backward. I1107 11:51:35.971935 564858 dygraph_functions.cc:60776] Running AD API: gaussian I1107 11:51:35.971940 564858 dygraph_functions.cc:60796] { Input: []} I1107 11:51:35.971997 564858 dygraph_functions.cc:68140] Running AD API: subtract I1107 11:51:35.972007 564858 dygraph_functions.cc:68212] { Input: [ ( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.972057 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e61a00), and remaining 0 I1107 11:51:35.972070 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.972107 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from SubtractGradNode (addr: 0x5abefbbb6360) to AddGradNode (addr: 0x5abefbbb45b0) I1107 11:51:35.972132 564858 dygraph_functions.cc:45444] Running AD API: square I1107 11:51:35.972141 564858 dygraph_functions.cc:45500] { Input: [ ( x , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.972183 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e61c00), and remaining 0 I1107 11:51:35.972196 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.982776 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from SquareGradNode (addr: 0x5abef00c9f20) to SubtractGradNode (addr: 0x5abefbbb6360) I1107 11:51:35.982831 564858 dygraph_functions.cc:63236] Running AD API: mean I1107 11:51:35.982846 564858 dygraph_functions.cc:63292] { Input: [ ( x , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.982913 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e61e00), and remaining 0 I1107 11:51:35.983458 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e62000), and remaining 0 I1107 11:51:35.984371 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from MeanGradNode (addr: 0x5abef00bce50) to SquareGradNode (addr: 0x5abef00c9f20) I1107 11:51:35.984504 564858 backward.cc:431] Run in Backward I1107 11:51:35.984510 564858 backward.cc:113] Start Backward I1107 11:51:35.984520 564858 backward.cc:196] Fill grad input tensor 0 with 1.0 I1107 11:51:35.984555 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.984586 564858 backward.cc:254] Preparing GradNode:MeanGradNode addr:0x5abef00bce50 I1107 11:51:35.984599 564858 nodes.cc:36296] Running AD API GRAD: mean_grad I1107 11:51:35.984637 564858 nodes.cc:36346] { Input: [ ( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.984675 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:35.998857 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:35.998872 564858 backward.cc:323] Node: MeanGradNode addr:0x5abef00bce50, Found pending node: SquareGradNode addr: 0x5abef00c9f20 I1107 11:51:35.998881 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:35.998895 564858 backward.cc:254] Preparing GradNode:SquareGradNode addr:0x5abef00c9f20 I1107 11:51:35.998908 564858 nodes.cc:26375] Running AD API GRAD: square_grad I1107 11:51:35.998932 564858 nodes.cc:26442] { Input: [ ( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]), ]} I1107 11:51:35.998967 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:36.015367 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:36.015383 564858 backward.cc:323] Node: SquareGradNode addr:0x5abef00c9f20, Found pending node: SubtractGradNode addr: 0x5abefbbb6360 I1107 11:51:36.015389 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:36.015395 564858 backward.cc:254] Preparing GradNode:SubtractGradNode addr:0x5abefbbb6360 I1107 11:51:36.015400 564858 nodes.cc:39588] Running AD API GRAD: subtract_grad I1107 11:51:36.015420 564858 nodes.cc:39664] { Input: [ ( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.015456 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:36.015462 564858 backward.cc:323] Node: SubtractGradNode addr:0x5abefbbb6360, Found pending node: AddGradNode addr: 0x5abefbbb45b0 I1107 11:51:36.015467 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:36.015473 564858 backward.cc:254] Preparing GradNode:AddGradNode addr:0x5abefbbb45b0 I1107 11:51:36.015482 564858 nodes.cc:31050] Running AD API GRAD: add_grad I1107 11:51:36.015494 564858 nodes.cc:31126] { Input: [ ( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.016105 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:36.016113 564858 backward.cc:323] Node: AddGradNode addr:0x5abefbbb45b0, Found pending node: MatmulGradNode addr: 0x5abefb3c03d0 I1107 11:51:36.016119 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:36.016124 564858 backward.cc:323] Node: AddGradNode addr:0x5abefbbb45b0, Found pending node: GradNodeAccumulation addr: 0x5abef19bca30 I1107 11:51:36.016129 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:36.016132 564858 backward.cc:254] Preparing GradNode:GradNodeAccumulation addr:0x5abef19bca30 I1107 11:51:36.016139 564858 accumulation_node.cc:157] Running AD API Grad: GradNodeAccumulation I1107 11:51:36.016141 564858 accumulation_node.cc:40] Move Tensor ptr: 0x5abefbbb6ba0 I1107 11:51:36.016148 564858 reducer.cc:768] Tensor[3] [linear_1.b_0@Grad] arrived and triggered disthook I1107 11:51:36.016155 564858 reducer.cc:784] Tensor[3][linear_1.b_0] is marked ready. I1107 11:51:36.016161 564858 accumulation_node.cc:193] Finish AD API Grad: GradNodeAccumulation I1107 11:51:36.016166 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:36.016170 564858 backward.cc:254] Preparing GradNode:MatmulGradNode addr:0x5abefb3c03d0 I1107 11:51:36.016178 564858 nodes.cc:35691] Running AD API GRAD: matmul_grad I1107 11:51:36.016192 564858 nodes.cc:35748] { Input: [ ( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.016247 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e62200), and remaining 0 I1107 11:51:36.018721 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:36.018735 564858 backward.cc:323] Node: MatmulGradNode addr:0x5abefb3c03d0, Found pending node: AddGradNode addr: 0x5abef694e240 I1107 11:51:36.018741 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:36.018747 564858 backward.cc:323] Node: MatmulGradNode addr:0x5abefb3c03d0, Found pending node: GradNodeAccumulation addr: 0x5abef19bbba0 I1107 11:51:36.018752 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:36.018756 564858 backward.cc:254] Preparing GradNode:GradNodeAccumulation addr:0x5abef19bbba0 I1107 11:51:36.018760 564858 accumulation_node.cc:157] Running AD API Grad: GradNodeAccumulation I1107 11:51:36.018764 564858 accumulation_node.cc:40] Move Tensor ptr: 0x5abefbbb4e20 I1107 11:51:36.018766 564858 reducer.cc:768] Tensor[2] [linear_1.w_0@Grad] arrived and triggered disthook I1107 11:51:36.018771 564858 reducer.cc:784] Tensor[2][linear_1.w_0] is marked ready. I1107 11:51:36.018776 564858 accumulation_node.cc:193] Finish AD API Grad: GradNodeAccumulation I1107 11:51:36.018780 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:36.018783 564858 backward.cc:254] Preparing GradNode:AddGradNode addr:0x5abef694e240 I1107 11:51:36.018787 564858 nodes.cc:31050] Running AD API GRAD: add_grad I1107 11:51:36.018801 564858 nodes.cc:31126] { Input: [ ( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.018858 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:36.018864 564858 backward.cc:323] Node: AddGradNode addr:0x5abef694e240, Found pending node: MatmulGradNode addr: 0x5abef1b7fd20 I1107 11:51:36.018867 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:36.018872 564858 backward.cc:323] Node: AddGradNode addr:0x5abef694e240, Found pending node: GradNodeAccumulation addr: 0x5abef1836d60 I1107 11:51:36.018877 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:36.018880 564858 backward.cc:254] Preparing GradNode:GradNodeAccumulation addr:0x5abef1836d60 I1107 11:51:36.018885 564858 accumulation_node.cc:157] Running AD API Grad: GradNodeAccumulation I1107 11:51:36.018889 564858 accumulation_node.cc:40] Move Tensor ptr: 0x5abeee981b90 I1107 11:51:36.018893 564858 reducer.cc:768] Tensor[1] [linear_0.b_0@Grad] arrived and triggered disthook I1107 11:51:36.018898 564858 reducer.cc:784] Tensor[1][linear_0.b_0] is marked ready. I1107 11:51:36.018903 564858 accumulation_node.cc:193] Finish AD API Grad: GradNodeAccumulation I1107 11:51:36.018908 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:36.018911 564858 backward.cc:254] Preparing GradNode:MatmulGradNode addr:0x5abef1b7fd20 I1107 11:51:36.018915 564858 nodes.cc:35691] Running AD API GRAD: matmul_grad I1107 11:51:36.018926 564858 nodes.cc:35748] { Input: [ ( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.018994 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:36.019001 564858 backward.cc:323] Node: MatmulGradNode addr:0x5abef1b7fd20, Found pending node: GradNodeAccumulation addr: 0x5abef1012670 I1107 11:51:36.019003 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0 I1107 11:51:36.019008 564858 backward.cc:254] Preparing GradNode:GradNodeAccumulation addr:0x5abef1012670 I1107 11:51:36.019012 564858 accumulation_node.cc:157] Running AD API Grad: GradNodeAccumulation I1107 11:51:36.019016 564858 accumulation_node.cc:40] Move Tensor ptr: 0x5abefdd6c450 I1107 11:51:36.019021 564858 reducer.cc:768] Tensor[0] [linear_0.w_0@Grad] arrived and triggered disthook I1107 11:51:36.019024 564858 reducer.cc:784] Tensor[0][linear_0.w_0] is marked ready. I1107 11:51:36.019032 564858 reducer.cc:933] Group[0] is ready I1107 11:51:36.019035 564858 reducer.cc:1073] group [0] start fused_allreduce. I1107 11:51:36.059813 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=121, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:36.060494 564858 process_group_nccl.cc:238] [ncclAllReduce] sendbuff: 0x736062e62200, recvbuff: 0x736062e62200, count: 121, datatype: float32, redop: SUM, ncclcomm: 0x5abef19d7f50, stream: 0x5abeedaf5960, rank_in_group: 0, nranks: 4, sync_op: 0, use_calc_stream: 0rank_in_group: 0, nranks: 4, gid: 0, backend: NCCL I1107 11:51:36.060643 564858 reducer.cc:429] Free densecontents 121 I1107 11:51:36.060663 564858 reducer.cc:1064] In the batch, Reducer is finished. I1107 11:51:36.060668 564858 accumulation_node.cc:193] Finish AD API Grad: GradNodeAccumulation I1107 11:51:36.060672 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes. I1107 11:51:36.061123 564858 eager.cc:119] Tensor(learning_rate_0) have not GradNode, add GradNodeAccumulation0x5abf09987670 for it. I1107 11:51:36.061206 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.061219 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.061254 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e62400), and remaining 0 I1107 11:51:36.061264 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:36.061373 564858 eager.cc:119] Tensor(linear_0.w_0_moment1_0) have not GradNode, add GradNodeAccumulation0x5abf09988610 for it. I1107 11:51:36.061429 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.061437 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.061457 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e62600), and remaining 0 I1107 11:51:36.061465 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=100, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:36.061496 564858 eager.cc:119] Tensor(linear_0.w_0_moment2_0) have not GradNode, add GradNodeAccumulation0x5abf099898a0 for it. I1107 11:51:36.061529 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.061538 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.061549 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e62800), and remaining 0 I1107 11:51:36.061555 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=100, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:36.061584 564858 eager.cc:119] Tensor(linear_0.w_0_beta1_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf0998aad0 for it. I1107 11:51:36.061631 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.061638 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.061722 564858 eager.cc:119] Tensor(linear_0.w_0_beta2_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf0998c030 for it. I1107 11:51:36.061735 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.061740 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.061779 564858 eager.cc:119] Tensor(linear_0.b_0_moment1_0) have not GradNode, add GradNodeAccumulation0x5abf0998caa0 for it. I1107 11:51:36.061818 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.061825 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.061838 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e62a00), and remaining 0 I1107 11:51:36.061844 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:36.061872 564858 eager.cc:119] Tensor(linear_0.b_0_moment2_0) have not GradNode, add GradNodeAccumulation0x5abf0998dea0 for it. I1107 11:51:36.061904 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.061910 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.061920 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e62c00), and remaining 0 I1107 11:51:36.061926 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:36.061950 564858 eager.cc:119] Tensor(linear_0.b_0_beta1_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf0998f3a0 for it. I1107 11:51:36.061964 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.061969 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.061996 564858 eager.cc:119] Tensor(linear_0.b_0_beta2_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf099902e0 for it. I1107 11:51:36.062009 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.062014 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.062047 564858 eager.cc:119] Tensor(linear_1.w_0_moment1_0) have not GradNode, add GradNodeAccumulation0x5abf09991030 for it. I1107 11:51:36.062083 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.062088 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.062103 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e62e00), and remaining 0 I1107 11:51:36.062110 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:36.062135 564858 eager.cc:119] Tensor(linear_1.w_0_moment2_0) have not GradNode, add GradNodeAccumulation0x5abf099923e0 for it. I1107 11:51:36.062165 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.062172 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.062186 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e63000), and remaining 0 I1107 11:51:36.062192 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:36.062227 564858 eager.cc:119] Tensor(linear_1.w_0_beta1_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf09993d50 for it. I1107 11:51:36.062239 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.062245 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.062264 564858 eager.cc:119] Tensor(linear_1.w_0_beta2_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf09994a60 for it. I1107 11:51:36.062276 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.062280 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.062314 564858 eager.cc:119] Tensor(linear_1.b_0_moment1_0) have not GradNode, add GradNodeAccumulation0x5abf099958e0 for it. I1107 11:51:36.062346 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.062352 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.062364 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e63200), and remaining 0 I1107 11:51:36.062369 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:36.062392 564858 eager.cc:119] Tensor(linear_1.b_0_moment2_0) have not GradNode, add GradNodeAccumulation0x5abf09996de0 for it. I1107 11:51:36.062422 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.062427 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.062439 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e63400), and remaining 0 I1107 11:51:36.062445 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512 I1107 11:51:36.062469 564858 eager.cc:119] Tensor(linear_1.b_0_beta1_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf09998310 for it. I1107 11:51:36.062479 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.062485 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.062510 564858 eager.cc:119] Tensor(linear_1.b_0_beta2_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf099992e0 for it. I1107 11:51:36.062520 564858 dygraphfunctions.cc:59543] Running AD API: full I1107 11:51:36.062526 564858 dygraph_functions.cc:59583] { Input: [ ( output , [[ Not specified tensor log level ]]), ]} I1107 11:51:36.062577 564858 dygraphfunctions.cc:2692] Running AD API: adam I1107 11:51:36.062600 564858 dygraph_functions.cc:2777] { Input: [ ( param , [[ Not specified tensor log level ]]),
( grad , [[ Not specified tensor log level ]]),
( learning_rate , [[ Not specified tensor log level ]]),
( moment1 , [[ Not specified tensor log level ]]),
( moment2 , [[ Not specified tensor log level ]]),
( beta1_pow , [[ Not specified tensor log level ]]),
( beta2_pow , [[ Not specified tensor log level ]]),
( master_param , [{ UnDefinedTensor }]),
( skip_update , [{ UnDefinedTensor }]), ]} I1107 11:51:36.062630 564858 multiary.cc:184] dims of Beta1Pow : [1] I1107 11:51:36.062635 564858 multiary.cc:192] dims of Beta2Pow : [1] I1107 11:51:36.062642 564858 adam_kernel.cu:187] beta1_pow.numel() : 1beta2_pow.numel() : 1 I1107 11:51:36.062647 564858 adam_kernel.cu:189] param.numel(): 100

PaddlePaddle / Paddle

paddle.utils.run_check()卡住 #69092

请提出你的问题 Please ask your question

I1031 08:39:29.439553 129 tcp_utils.cc:107] Retry to connect to 127.0.0.1:52872 while the server is not yet listening. ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')

FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='')

I1107 11:51:31.831405 564862 tcp_utils.cc:107] Retry to connect to 127.0.0.1:36617 while the server is not yet listening. ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='')

I1107 11:51:31.870724 564864 tcp_utils.cc:107] Retry to connect to 127.0.0.1:36617 while the server is not yet listening. ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')