Closed zjlw closed 4 minutes ago
请问你的机器使用的是什么显卡,设置CUDA_VISIBLE_DEVICES=0 和 CUDA_VISIBLE_DEVICES=0,1 看看单卡和两卡是否有问题。
Quote reply
4张L40S
单卡是可以的
import os os.environ["CUDA_VISIBLE_DEVICES"] = "1" import paddle grep: warning: GREP_OPTIONS is deprecated; please use an alias or script paddle.utils.run_check() Running verify PaddlePaddle program ... I1101 09:44:44.325242 1021 interpretercore.cc:237] New Executor is Running. W1101 09:44:44.325762 1021 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.4, Runtime API Version: 12.0 W1101 09:44:44.326946 1021 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8. I1101 09:44:44.397505 1021 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
两张卡也可以 import os os.environ["CUDA_VISIBLE_DEVICES"] = "1,2" import paddle grep: warning: GREP_OPTIONS is deprecated; please use an alias or script paddle.utils.run_check() Running verify PaddlePaddle program ... I1101 09:45:50.375473 1097 interpretercore.cc:237] New Executor is Running. W1101 09:45:50.376031 1097 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.4, Runtime API Version: 12.0 W1101 09:45:50.377082 1097 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8. I1101 09:45:50.476362 1097 interpreter_util.cc:518] Standalone Executor is Used. PaddlePaddle works well on 1 GPU. grep: warning: GREP_OPTIONS is deprecated; please use an alias or script grep: warning: GREP_OPTIONS is deprecated; please use an alias or script ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
I1101 09:45:51.987974 1175 tcp_utils.cc:181] The server starts to listen on IP_ANY:44670 I1101 09:45:51.988540 1175 tcp_utils.cc:130] Successfully connected to 127.0.0.1:44670 ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='1', default_value='')
I1101 09:45:51.994066 1177 tcp_utils.cc:130] Successfully connected to 127.0.0.1:44670 W1101 09:45:52.305198 1175 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.4, Runtime API Version: 12.0 W1101 09:45:52.306423 1175 gpu_resources.cc:149] device: 0, cuDNN Version: 8.8. W1101 09:45:52.384938 1177 gpu_resources.cc:119] Please NOTE: device: 1, GPU Compute Capability: 8.9, Driver API Version: 12.4, Runtime API Version: 12.0 W1101 09:45:52.387213 1177 gpu_resources.cc:149] device: 1, cuDNN Version: 8.8. I1101 09:45:53.163136 1201 tcp_store.cc:273] receive shutdown event and so quit from MasterDaemon run loop PaddlePaddle works well on 2 GPUs. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
我在用python -m paddle.distributed.launch 跑paddleocr时指定两张卡也会出现,显存只占一点,但使用率100%,训练卡住的情况
这种型号的显卡我们好像还没有适配过,正常两卡能跑4卡应该也不会有太多问题才对。使用3.0.0b1试试呢?
遇到一些奇怪的问题,python -m paddle.distributed.launch,指定“1,2”、“0,4”的时候可以跑,指定“0,1”、“0,1,2,3”卡住
这种型号的显卡我们好像还没有适配过,正常两卡能跑4卡应该也不会有太多问题才对。使用3.0.0b1试试呢?
3.0b1,4张显卡训练一样是卡的
I1107 11:51:31.902621 564860 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36617
I1107 11:51:31.923488 564860 process_group_nccl.cc:129] ProcessGroupNCCL pgtimeout 1800000
I1107 11:51:32.051061 564860 eager.cc:119] Tensor(linear_0.w_0) have not GradNode, add GradNodeAccumulation0x61d2ea900340 for it.
I1107 11:51:32.052687 564860 layout_autotune.cc:84] The number of layout agnostic OPs: 626, heavily layout sensitive OPs: 37, lightly layout sensitive OPs: 144
I1107 11:51:32.052911 564860 dygraph_functions.cc:70087] Running AD API: uniform
I1107 11:51:32.052918 564860 dygraph_functions.cc:70107] { Input: []}
W1107 11:51:32.054075 564860 gpu_resources.cc:119] Please NOTE: device: 1, GPU Compute Capability: 8.9, Driver API Version: 12.7, Runtime API Version: 12.0
I1107 11:51:32.054250 564860 dynamic_loader.cc:227] Try to find library: libcudnn.so from default system path.
W1107 11:51:32.054569 564860 gpu_resources.cc:164] device: 1, cuDNN Version: 9.5.
I1107 11:51:32.064901 564860 dynamic_loader.cc:227] Try to find library: libcuda.so from default system path.
I1107 11:51:32.065248 564860 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x7c6abae00000), and remaining 0
I1107 11:51:32.066383 564860 eager.cc:119] Tensor(linear_0.b_0) have not GradNode, add GradNodeAccumulation0x61d2eb21ca30 for it.
I1107 11:51:32.066469 564860 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:32.066495 564860 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:32.066550 564860 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c6abae00200), and remaining 0
I1107 11:51:32.066569 564860 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:32.067461 564860 eager.cc:119] Tensor(linear_1.w_0) have not GradNode, add GradNodeAccumulation0x61d2eb3a15c0 for it.
I1107 11:51:32.067524 564860 dygraph_functions.cc:70087] Running AD API: uniform
I1107 11:51:32.067529 564860 dygraph_functions.cc:70107] { Input: []}
I1107 11:51:32.067556 564860 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c6abae00400), and remaining 0
I1107 11:51:32.067656 564860 eager.cc:119] Tensor(linear_1.b_0) have not GradNode, add GradNodeAccumulation0x61d2eb3a2430 for it.
I1107 11:51:32.067679 564860 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:32.067689 564860 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:32.067704 564860 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c6abae00600), and remaining 0
I1107 11:51:32.067713 564860 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:32.068111 564860 process_group_nccl.cc:702] init nccl rank_in_group: 1, nranks: 4, gid: 0, place key: Place(gpu:1), store_key: nccl_ids/0/0
I1107 11:51:32.068403 564860 dynamic_loader.cc:227] Try to find library: libnccl.so from default system path.
I1107 11:51:34.831647 564862 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36617
I1107 11:51:34.871026 564864 tcp_utils.cc:130] Successfully connected to 127.0.0.1:36617
I1107 11:51:34.908504 564862 process_group_nccl.cc:129] ProcessGroupNCCL pgtimeout 1800000
I1107 11:51:34.908720 564864 process_group_nccl.cc:129] ProcessGroupNCCL pgtimeout 1800000
I1107 11:51:35.001463 564858 process_group_nccl.cc:129] ProcessGroupNCCL pgtimeout 1800000
I1107 11:51:35.197360 564862 eager.cc:119] Tensor(linear_0.w_0) have not GradNode, add GradNodeAccumulation0x643fc02c4130 for it.
I1107 11:51:35.198612 564862 layout_autotune.cc:84] The number of layout agnostic OPs: 626, heavily layout sensitive OPs: 37, lightly layout sensitive OPs: 144
I1107 11:51:35.198832 564862 dygraph_functions.cc:70087] Running AD API: uniform
I1107 11:51:35.198839 564862 dygraph_functions.cc:70107] { Input: []}
W1107 11:51:35.199951 564862 gpu_resources.cc:119] Please NOTE: device: 2, GPU Compute Capability: 8.9, Driver API Version: 12.7, Runtime API Version: 12.0
I1107 11:51:35.200111 564862 dynamic_loader.cc:227] Try to find library: libcudnn.so from default system path.
W1107 11:51:35.200461 564862 gpu_resources.cc:164] device: 2, cuDNN Version: 9.5.
I1107 11:51:35.206266 564864 eager.cc:119] Tensor(linear_0.w_0) have not GradNode, add GradNodeAccumulation0x643843ba1ce0 for it.
I1107 11:51:35.207821 564864 layout_autotune.cc:84] The number of layout agnostic OPs: 626, heavily layout sensitive OPs: 37, lightly layout sensitive OPs: 144
I1107 11:51:35.208086 564864 dygraph_functions.cc:70087] Running AD API: uniform
I1107 11:51:35.208091 564864 dygraph_functions.cc:70107] { Input: []}
W1107 11:51:35.209548 564864 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 8.9, Driver API Version: 12.7, Runtime API Version: 12.0
I1107 11:51:35.209707 564864 dynamic_loader.cc:227] Try to find library: libcudnn.so from default system path.
W1107 11:51:35.210011 564864 gpu_resources.cc:164] device: 3, cuDNN Version: 9.5.
I1107 11:51:35.210270 564862 dynamic_loader.cc:227] Try to find library: libcuda.so from default system path.
I1107 11:51:35.212386 564862 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x77f85ae00000), and remaining 0
I1107 11:51:35.215036 564862 eager.cc:119] Tensor(linear_0.b_0) have not GradNode, add GradNodeAccumulation0x643fc0b51090 for it.
I1107 11:51:35.215116 564862 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:35.215137 564862 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.215183 564862 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x77f85ae00200), and remaining 0
I1107 11:51:35.215205 564862 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.216069 564862 eager.cc:119] Tensor(linear_1.w_0) have not GradNode, add GradNodeAccumulation0x643fc0cd5ee0 for it.
I1107 11:51:35.216130 564862 dygraph_functions.cc:70087] Running AD API: uniform
I1107 11:51:35.216135 564862 dygraph_functions.cc:70107] { Input: []}
I1107 11:51:35.216156 564862 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x77f85ae00400), and remaining 0
I1107 11:51:35.216266 564862 eager.cc:119] Tensor(linear_1.b_0) have not GradNode, add GradNodeAccumulation0x643fc0cd6d40 for it.
I1107 11:51:35.216290 564862 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:35.216302 564862 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.216316 564862 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x77f85ae00600), and remaining 0
I1107 11:51:35.216324 564862 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.216734 564862 process_group_nccl.cc:702] init nccl rank_in_group: 2, nranks: 4, gid: 0, place key: Place(gpu:2), store_key: nccl_ids/0/0
I1107 11:51:35.217023 564862 dynamic_loader.cc:227] Try to find library: libnccl.so from default system path.
I1107 11:51:35.218042 564864 dynamic_loader.cc:227] Try to find library: libcuda.so from default system path.
I1107 11:51:35.218358 564864 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x7c8832e00000), and remaining 0
I1107 11:51:35.219518 564864 eager.cc:119] Tensor(linear_0.b_0) have not GradNode, add GradNodeAccumulation0x64384431c5d0 for it.
I1107 11:51:35.219609 564864 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:35.219630 564864 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.219679 564864 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c8832e00200), and remaining 0
I1107 11:51:35.219695 564864 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.220618 564864 eager.cc:119] Tensor(linear_1.w_0) have not GradNode, add GradNodeAccumulation0x6438444a1460 for it.
I1107 11:51:35.220692 564864 dygraph_functions.cc:70087] Running AD API: uniform
I1107 11:51:35.220697 564864 dygraph_functions.cc:70107] { Input: []}
I1107 11:51:35.220723 564864 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c8832e00400), and remaining 0
I1107 11:51:35.220826 564864 eager.cc:119] Tensor(linear_1.b_0) have not GradNode, add GradNodeAccumulation0x6438444a22f0 for it.
I1107 11:51:35.220849 564864 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:35.220860 564864 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.220875 564864 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x7c8832e00600), and remaining 0
I1107 11:51:35.220885 564864 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.221338 564864 process_group_nccl.cc:702] init nccl rank_in_group: 3, nranks: 4, gid: 0, place key: Place(gpu:3), store_key: nccl_ids/0/0
I1107 11:51:35.221652 564864 dynamic_loader.cc:227] Try to find library: libnccl.so from default system path.
I1107 11:51:35.224038 564858 eager.cc:119] Tensor(linear_0.w_0) have not GradNode, add GradNodeAccumulation0x5abef1012670 for it.
I1107 11:51:35.225282 564858 layout_autotune.cc:84] The number of layout agnostic OPs: 626, heavily layout sensitive OPs: 37, lightly layout sensitive OPs: 144
I1107 11:51:35.225502 564858 dygraph_functions.cc:70087] Running AD API: uniform
I1107 11:51:35.225508 564858 dygraph_functions.cc:70107] { Input: []}
W1107 11:51:35.226593 564858 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.9, Driver API Version: 12.7, Runtime API Version: 12.0
I1107 11:51:35.226743 564858 dynamic_loader.cc:227] Try to find library: libcudnn.so from default system path.
W1107 11:51:35.227048 564858 gpu_resources.cc:164] device: 0, cuDNN Version: 9.5.
I1107 11:51:35.234910 564858 dynamic_loader.cc:227] Try to find library: libcuda.so from default system path.
I1107 11:51:35.235195 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e00000), and remaining 0
I1107 11:51:35.236325 564858 eager.cc:119] Tensor(linear_0.b_0) have not GradNode, add GradNodeAccumulation0x5abef1836d60 for it.
I1107 11:51:35.236399 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:35.236418 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.236459 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e00200), and remaining 0
I1107 11:51:35.236474 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.237351 564858 eager.cc:119] Tensor(linear_1.w_0) have not GradNode, add GradNodeAccumulation0x5abef19bbba0 for it.
I1107 11:51:35.237411 564858 dygraph_functions.cc:70087] Running AD API: uniform
I1107 11:51:35.237416 564858 dygraph_functions.cc:70107] { Input: []}
I1107 11:51:35.237437 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e00400), and remaining 0
I1107 11:51:35.237535 564858 eager.cc:119] Tensor(linear_1.b_0) have not GradNode, add GradNodeAccumulation0x5abef19bca30 for it.
I1107 11:51:35.237557 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:35.237568 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.237582 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e00600), and remaining 0
I1107 11:51:35.237591 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.237993 564858 process_group_nccl.cc:702] init nccl rank_in_group: 0, nranks: 4, gid: 0, place key: Place(gpu:0), store_key: nccl_ids/0/0
I1107 11:51:35.238263 564858 dynamic_loader.cc:227] Try to find library: libnccl.so from default system path.
I1107 11:51:35.238903 564858 comm_context_manager.cc:90] init NCCLCommContext rank: 0, size: 4, unique_comm_key: nccl_ids/0/0, unique_key: NCCLCommContext/nccl_ids/0/0, nccl_id: ac2e8022578c07320828dc0a88280000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
I1107 11:51:35.279462 564860 comm_context_manager.cc:90] init NCCLCommContext rank: 1, size: 4, unique_comm_key: nccl_ids/0/0, unique_key: NCCLCommContext/nccl_ids/0/0, nccl_id: ac2e8022578c07320828dc0a88280000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
I1107 11:51:35.279487 564862 comm_context_manager.cc:90] init NCCLCommContext rank: 2, size: 4, unique_comm_key: nccl_ids/0/0, unique_key: NCCLCommContext/nccl_ids/0/0, nccl_id: ac2e8022578c07320828dc0a88280000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
I1107 11:51:35.320335 564864 comm_context_manager.cc:90] init NCCLCommContext rank: 3, size: 4, unique_comm_key: nccl_ids/0/0, unique_key: NCCLCommContext/nccl_ids/0/0, nccl_id: ac2e8022578c07320828dc0a88280000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
I1107 11:51:35.562363 564862 process_group_nccl.cc:725] Get nccl comm: 0x643fc0cf1ee0 for place_key: Place(gpu:2) on rank_in_group: 2 nranks: 4 gid: 0
I1107 11:51:35.562394 564864 process_group_nccl.cc:725] Get nccl comm: 0x6438444bd670 for place_key: Place(gpu:3) on rank_in_group: 3 nranks: 4 gid: 0
I1107 11:51:35.562404 564858 process_group_nccl.cc:725] Get nccl comm: 0x5abef19d7f50 for place_key: Place(gpu:0) on rank_in_group: 0 nranks: 4 gid: 0
I1107 11:51:35.562404 564860 process_group_nccl.cc:725] Get nccl comm: 0x61d2eb3bd5b0 for place_key: Place(gpu:1) on rank_in_group: 1 nranks: 4 gid: 0
I1107 11:51:35.562464 564862 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x77f85ae00000, recvbuff: 0x77f85ae00000, count: 100, datatype: float32, root: 0, ncclcomm: 0x643fc0cf1ee0, stream: 0x643fbfdaee30, rank_in_group: 2, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 2, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.562492 564858 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x736062e00000, recvbuff: 0x736062e00000, count: 100, datatype: float32, root: 0, ncclcomm: 0x5abef19d7f50, stream: 0x5abeedaf5960, rank_in_group: 0, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 0, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.562491 564864 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c8832e00000, recvbuff: 0x7c8832e00000, count: 100, datatype: float32, root: 0, ncclcomm: 0x6438444bd670, stream: 0x64384356dd10, rank_in_group: 3, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 3, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.562521 564860 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c6abae00000, recvbuff: 0x7c6abae00000, count: 100, datatype: float32, root: 0, ncclcomm: 0x61d2eb3bd5b0, stream: 0x61d2ea47a8e0, rank_in_group: 1, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 1, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.601517 564860 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c6abae00200, recvbuff: 0x7c6abae00200, count: 10, datatype: float32, root: 0, ncclcomm: 0x61d2eb3bd5b0, stream: 0x61d2ea47a8e0, rank_in_group: 1, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 1, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.601598 564860 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c6abae00400, recvbuff: 0x7c6abae00400, count: 10, datatype: float32, root: 0, ncclcomm: 0x61d2eb3bd5b0, stream: 0x61d2ea47a8e0, rank_in_group: 1, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 1, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.601631 564860 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c6abae00600, recvbuff: 0x7c6abae00600, count: 1, datatype: float32, root: 0, ncclcomm: 0x61d2eb3bd5b0, stream: 0x61d2ea47a8e0, rank_in_group: 1, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 1, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.601874 564860 reducer.cc:103] var[linear_0.w_0] 's type is float32
I1107 11:51:35.601887 564860 reducer.cc:103] var[linear_0.b_0] 's type is float32
I1107 11:51:35.601894 564860 reducer.cc:103] var[linear_1.w_0] 's type is float32
I1107 11:51:35.601899 564860 reducer.cc:103] var[linear_1.b_0] 's type is float32
I1107 11:51:35.601940 564860 reducer.cc:486] Start construct the Reducer ...
I1107 11:51:35.601948 564860 reducer.cc:534] Start initialize groups ..
I1107 11:51:35.601953 564860 reducer.cc:583] InitializeDenseGroups.
I1107 11:51:35.601974 564860 reducer.cc:577] The Group[0]:numel: 121 ;var number: 4
[0 1 2 3]
I1107 11:51:35.602404 564860 dygraph_functions.cc:60776] Running AD API: gaussian
I1107 11:51:35.602411 564860 dygraph_functions.cc:60796] { Input: []}
I1107 11:51:35.602491 564860 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x7c6abae00800), and remaining 0
I1107 11:51:35.602643 564858 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x736062e00200, recvbuff: 0x736062e00200, count: 10, datatype: float32, root: 0, ncclcomm: 0x5abef19d7f50, stream: 0x5abeedaf5960, rank_in_group: 0, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 0, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.602715 564858 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x736062e00400, recvbuff: 0x736062e00400, count: 10, datatype: float32, root: 0, ncclcomm: 0x5abef19d7f50, stream: 0x5abeedaf5960, rank_in_group: 0, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 0, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.602743 564858 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x736062e00600, recvbuff: 0x736062e00600, count: 1, datatype: float32, root: 0, ncclcomm: 0x5abef19d7f50, stream: 0x5abeedaf5960, rank_in_group: 0, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 0, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.602914 564858 reducer.cc:103] var[linear_0.w_0] 's type is float32
I1107 11:51:35.602923 564858 reducer.cc:103] var[linear_0.b_0] 's type is float32
I1107 11:51:35.602927 564858 reducer.cc:103] var[linear_1.w_0] 's type is float32
I1107 11:51:35.602931 564858 reducer.cc:103] var[linear_1.b_0] 's type is float32
I1107 11:51:35.602962 564858 reducer.cc:486] Start construct the Reducer ...
I1107 11:51:35.602967 564858 reducer.cc:534] Start initialize groups ..
I1107 11:51:35.602954 564862 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x77f85ae00200, recvbuff: 0x77f85ae00200, count: 10, datatype: float32, root: 0, ncclcomm: 0x643fc0cf1ee0, stream: 0x643fbfdaee30, rank_in_group: 2, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 2, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.602970 564858 reducer.cc:583] InitializeDenseGroups.
I1107 11:51:35.602990 564858 reducer.cc:577] The Group[0]:numel: 121 ;var number: 4
[0 1 2 3]
I1107 11:51:35.603018 564862 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x77f85ae00400, recvbuff: 0x77f85ae00400, count: 10, datatype: float32, root: 0, ncclcomm: 0x643fc0cf1ee0, stream: 0x643fbfdaee30, rank_in_group: 2, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 2, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.603044 564862 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x77f85ae00600, recvbuff: 0x77f85ae00600, count: 1, datatype: float32, root: 0, ncclcomm: 0x643fc0cf1ee0, stream: 0x643fbfdaee30, rank_in_group: 2, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 2, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.603212 564862 reducer.cc:103] var[linear_0.w_0] 's type is float32
I1107 11:51:35.603220 564862 reducer.cc:103] var[linear_0.b_0] 's type is float32
I1107 11:51:35.603205 564864 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c8832e00200, recvbuff: 0x7c8832e00200, count: 10, datatype: float32, root: 0, ncclcomm: 0x6438444bd670, stream: 0x64384356dd10, rank_in_group: 3, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 3, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.603225 564862 reducer.cc:103] var[linear_1.w_0] 's type is float32
I1107 11:51:35.603232 564862 reducer.cc:103] var[linear_1.b_0] 's type is float32
I1107 11:51:35.603260 564862 reducer.cc:486] Start construct the Reducer ...
I1107 11:51:35.603266 564862 reducer.cc:534] Start initialize groups ..
I1107 11:51:35.603267 564862 reducer.cc:583] InitializeDenseGroups.
I1107 11:51:35.603283 564862 reducer.cc:577] The Group[0]:numel: 121 ;var number: 4
[0 1 2 3]
I1107 11:51:35.603283 564864 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c8832e00400, recvbuff: 0x7c8832e00400, count: 10, datatype: float32, root: 0, ncclcomm: 0x6438444bd670, stream: 0x64384356dd10, rank_in_group: 3, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 3, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.603319 564858 dygraph_functions.cc:60776] Running AD API: gaussian
I1107 11:51:35.603319 564864 process_group_nccl.cc:373] [ncclBroadcast] sendbuff: 0x7c8832e00600, recvbuff: 0x7c8832e00600, count: 1, datatype: float32, root: 0, ncclcomm: 0x6438444bd670, stream: 0x64384356dd10, rank_in_group: 3, nranks: 4, sync_op: 1, use_calc_stream: 0rank_in_group: 3, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:35.603325 564858 dygraph_functions.cc:60796] { Input: []}
I1107 11:51:35.603379 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e00800), and remaining 0
I1107 11:51:35.603543 564864 reducer.cc:103] var[linear_0.w_0] 's type is float32
I1107 11:51:35.603554 564864 reducer.cc:103] var[linear_0.b_0] 's type is float32
I1107 11:51:35.603559 564864 reducer.cc:103] var[linear_1.w_0] 's type is float32
I1107 11:51:35.603564 564864 reducer.cc:103] var[linear_1.b_0] 's type is float32
I1107 11:51:35.603603 564864 reducer.cc:486] Start construct the Reducer ...
I1107 11:51:35.603606 564862 dygraph_functions.cc:60776] Running AD API: gaussian
I1107 11:51:35.603608 564864 reducer.cc:534] Start initialize groups ..
I1107 11:51:35.603610 564862 dygraph_functions.cc:60796] { Input: []}
I1107 11:51:35.603612 564864 reducer.cc:583] InitializeDenseGroups.
I1107 11:51:35.603636 564864 reducer.cc:577] The Group[0]:numel: 121 ;var number: 4
[0 1 2 3]
I1107 11:51:35.603662 564862 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x77f85ae00800), and remaining 0
I1107 11:51:35.604096 564864 dygraph_functions.cc:60776] Running AD API: gaussian
I1107 11:51:35.604103 564864 dygraph_functions.cc:60796] { Input: []}
I1107 11:51:35.604163 564864 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x7c8832e00800), and remaining 0
I1107 11:51:35.800139 564858 dygraph_functions.cc:62568] Running AD API: matmul
I1107 11:51:35.800189 564858 dygraph_functions.cc:62630] { Input: [
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.800351 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e00a00), and remaining 0
I1107 11:51:35.800366 564858 matmul_kernel_impl.h:374] MatMul's case 8
I1107 11:51:35.840879 564858 dynamic_loader.cc:227] Try to find library: libcublas.so from default system path.
I1107 11:51:35.920360 564858 grad_node_info.cc:293] Add Edges for slot: 1, the Edge is from MatmulGradNode (addr: 0x5abef1b7fd20) to GradNodeAccumulation (addr: 0x5abef1012670)
I1107 11:51:35.920406 564858 dygraph_functions.cc:52623] Running AD API: add
I1107 11:51:35.920445 564858 dygraph_functions.cc:52695] { Input: [
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.920547 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e61800), and remaining 0
I1107 11:51:35.920583 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=100, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.967942 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from AddGradNode (addr: 0x5abef694e240) to MatmulGradNode (addr: 0x5abef1b7fd20)
I1107 11:51:35.967962 564858 grad_node_info.cc:293] Add Edges for slot: 1, the Edge is from AddGradNode (addr: 0x5abef694e240) to GradNodeAccumulation (addr: 0x5abef1836d60)
I1107 11:51:35.968066 564858 dygraph_functions.cc:62568] Running AD API: matmul
I1107 11:51:35.968092 564858 dygraph_functions.cc:62630] { Input: [
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.968137 564858 matmul_kernel_impl.h:374] MatMul's case 8
I1107 11:51:35.971632 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from MatmulGradNode (addr: 0x5abefb3c03d0) to AddGradNode (addr: 0x5abef694e240)
I1107 11:51:35.971647 564858 grad_node_info.cc:293] Add Edges for slot: 1, the Edge is from MatmulGradNode (addr: 0x5abefb3c03d0) to GradNodeAccumulation (addr: 0x5abef19bbba0)
I1107 11:51:35.971655 564858 dygraph_functions.cc:52623] Running AD API: add
I1107 11:51:35.971670 564858 dygraph_functions.cc:52695] { Input: [
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.971712 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.971729 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from AddGradNode (addr: 0x5abefbbb45b0) to MatmulGradNode (addr: 0x5abefb3c03d0)
I1107 11:51:35.971733 564858 grad_node_info.cc:293] Add Edges for slot: 1, the Edge is from AddGradNode (addr: 0x5abefbbb45b0) to GradNodeAccumulation (addr: 0x5abef19bca30)
I1107 11:51:35.971820 564858 reducer.cc:680] after forward, then reset count for backward.
I1107 11:51:35.971935 564858 dygraph_functions.cc:60776] Running AD API: gaussian
I1107 11:51:35.971940 564858 dygraph_functions.cc:60796] { Input: []}
I1107 11:51:35.971997 564858 dygraph_functions.cc:68140] Running AD API: subtract
I1107 11:51:35.972007 564858 dygraph_functions.cc:68212] { Input: [
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.972057 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e61a00), and remaining 0
I1107 11:51:35.972070 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.972107 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from SubtractGradNode (addr: 0x5abefbbb6360) to AddGradNode (addr: 0x5abefbbb45b0)
I1107 11:51:35.972132 564858 dygraph_functions.cc:45444] Running AD API: square
I1107 11:51:35.972141 564858 dygraph_functions.cc:45500] { Input: [
( x , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.972183 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e61c00), and remaining 0
I1107 11:51:35.972196 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.982776 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from SquareGradNode (addr: 0x5abef00c9f20) to SubtractGradNode (addr: 0x5abefbbb6360)
I1107 11:51:35.982831 564858 dygraph_functions.cc:63236] Running AD API: mean
I1107 11:51:35.982846 564858 dygraph_functions.cc:63292] { Input: [
( x , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.982913 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e61e00), and remaining 0
I1107 11:51:35.983458 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e62000), and remaining 0
I1107 11:51:35.984371 564858 grad_node_info.cc:293] Add Edges for slot: 0, the Edge is from MeanGradNode (addr: 0x5abef00bce50) to SquareGradNode (addr: 0x5abef00c9f20)
I1107 11:51:35.984504 564858 backward.cc:431] Run in Backward
I1107 11:51:35.984510 564858 backward.cc:113] Start Backward
I1107 11:51:35.984520 564858 backward.cc:196] Fill grad input tensor 0 with 1.0
I1107 11:51:35.984555 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.984586 564858 backward.cc:254] Preparing GradNode:MeanGradNode addr:0x5abef00bce50
I1107 11:51:35.984599 564858 nodes.cc:36296] Running AD API GRAD: mean_grad
I1107 11:51:35.984637 564858 nodes.cc:36346] { Input: [
( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.984675 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:35.998857 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes.
I1107 11:51:35.998872 564858 backward.cc:323] Node: MeanGradNode addr:0x5abef00bce50, Found pending node: SquareGradNode addr: 0x5abef00c9f20
I1107 11:51:35.998881 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0
I1107 11:51:35.998895 564858 backward.cc:254] Preparing GradNode:SquareGradNode addr:0x5abef00c9f20
I1107 11:51:35.998908 564858 nodes.cc:26375] Running AD API GRAD: square_grad
I1107 11:51:35.998932 564858 nodes.cc:26442] { Input: [
( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]), ]}
I1107 11:51:35.998967 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:36.015367 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes.
I1107 11:51:36.015383 564858 backward.cc:323] Node: SquareGradNode addr:0x5abef00c9f20, Found pending node: SubtractGradNode addr: 0x5abefbbb6360
I1107 11:51:36.015389 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0
I1107 11:51:36.015395 564858 backward.cc:254] Preparing GradNode:SubtractGradNode addr:0x5abefbbb6360
I1107 11:51:36.015400 564858 nodes.cc:39588] Running AD API GRAD: subtract_grad
I1107 11:51:36.015420 564858 nodes.cc:39664] { Input: [
( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.015456 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes.
I1107 11:51:36.015462 564858 backward.cc:323] Node: SubtractGradNode addr:0x5abefbbb6360, Found pending node: AddGradNode addr: 0x5abefbbb45b0
I1107 11:51:36.015467 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0
I1107 11:51:36.015473 564858 backward.cc:254] Preparing GradNode:AddGradNode addr:0x5abefbbb45b0
I1107 11:51:36.015482 564858 nodes.cc:31050] Running AD API GRAD: add_grad
I1107 11:51:36.015494 564858 nodes.cc:31126] { Input: [
( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.016105 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes.
I1107 11:51:36.016113 564858 backward.cc:323] Node: AddGradNode addr:0x5abefbbb45b0, Found pending node: MatmulGradNode addr: 0x5abefb3c03d0
I1107 11:51:36.016119 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0
I1107 11:51:36.016124 564858 backward.cc:323] Node: AddGradNode addr:0x5abefbbb45b0, Found pending node: GradNodeAccumulation addr: 0x5abef19bca30
I1107 11:51:36.016129 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0
I1107 11:51:36.016132 564858 backward.cc:254] Preparing GradNode:GradNodeAccumulation addr:0x5abef19bca30
I1107 11:51:36.016139 564858 accumulation_node.cc:157] Running AD API Grad: GradNodeAccumulation
I1107 11:51:36.016141 564858 accumulation_node.cc:40] Move Tensor ptr: 0x5abefbbb6ba0
I1107 11:51:36.016148 564858 reducer.cc:768] Tensor[3] [linear_1.b_0@Grad] arrived and triggered disthook
I1107 11:51:36.016155 564858 reducer.cc:784] Tensor[3][linear_1.b_0] is marked ready.
I1107 11:51:36.016161 564858 accumulation_node.cc:193] Finish AD API Grad: GradNodeAccumulation
I1107 11:51:36.016166 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes.
I1107 11:51:36.016170 564858 backward.cc:254] Preparing GradNode:MatmulGradNode addr:0x5abefb3c03d0
I1107 11:51:36.016178 564858 nodes.cc:35691] Running AD API GRAD: matmul_grad
I1107 11:51:36.016192 564858 nodes.cc:35748] { Input: [
( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.016247 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e62200), and remaining 0
I1107 11:51:36.018721 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes.
I1107 11:51:36.018735 564858 backward.cc:323] Node: MatmulGradNode addr:0x5abefb3c03d0, Found pending node: AddGradNode addr: 0x5abef694e240
I1107 11:51:36.018741 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0
I1107 11:51:36.018747 564858 backward.cc:323] Node: MatmulGradNode addr:0x5abefb3c03d0, Found pending node: GradNodeAccumulation addr: 0x5abef19bbba0
I1107 11:51:36.018752 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0
I1107 11:51:36.018756 564858 backward.cc:254] Preparing GradNode:GradNodeAccumulation addr:0x5abef19bbba0
I1107 11:51:36.018760 564858 accumulation_node.cc:157] Running AD API Grad: GradNodeAccumulation
I1107 11:51:36.018764 564858 accumulation_node.cc:40] Move Tensor ptr: 0x5abefbbb4e20
I1107 11:51:36.018766 564858 reducer.cc:768] Tensor[2] [linear_1.w_0@Grad] arrived and triggered disthook
I1107 11:51:36.018771 564858 reducer.cc:784] Tensor[2][linear_1.w_0] is marked ready.
I1107 11:51:36.018776 564858 accumulation_node.cc:193] Finish AD API Grad: GradNodeAccumulation
I1107 11:51:36.018780 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes.
I1107 11:51:36.018783 564858 backward.cc:254] Preparing GradNode:AddGradNode addr:0x5abef694e240
I1107 11:51:36.018787 564858 nodes.cc:31050] Running AD API GRAD: add_grad
I1107 11:51:36.018801 564858 nodes.cc:31126] { Input: [
( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.018858 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes.
I1107 11:51:36.018864 564858 backward.cc:323] Node: AddGradNode addr:0x5abef694e240, Found pending node: MatmulGradNode addr: 0x5abef1b7fd20
I1107 11:51:36.018867 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0
I1107 11:51:36.018872 564858 backward.cc:323] Node: AddGradNode addr:0x5abef694e240, Found pending node: GradNodeAccumulation addr: 0x5abef1836d60
I1107 11:51:36.018877 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0
I1107 11:51:36.018880 564858 backward.cc:254] Preparing GradNode:GradNodeAccumulation addr:0x5abef1836d60
I1107 11:51:36.018885 564858 accumulation_node.cc:157] Running AD API Grad: GradNodeAccumulation
I1107 11:51:36.018889 564858 accumulation_node.cc:40] Move Tensor ptr: 0x5abeee981b90
I1107 11:51:36.018893 564858 reducer.cc:768] Tensor[1] [linear_0.b_0@Grad] arrived and triggered disthook
I1107 11:51:36.018898 564858 reducer.cc:784] Tensor[1][linear_0.b_0] is marked ready.
I1107 11:51:36.018903 564858 accumulation_node.cc:193] Finish AD API Grad: GradNodeAccumulation
I1107 11:51:36.018908 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes.
I1107 11:51:36.018911 564858 backward.cc:254] Preparing GradNode:MatmulGradNode addr:0x5abef1b7fd20
I1107 11:51:36.018915 564858 nodes.cc:35691] Running AD API GRAD: matmul_grad
I1107 11:51:36.018926 564858 nodes.cc:35748] { Input: [
( grad_out , [[ Not specified tensor log level ]]),
( x , [[ Not specified tensor log level ]]),
( y , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.018994 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes.
I1107 11:51:36.019001 564858 backward.cc:323] Node: MatmulGradNode addr:0x5abef1b7fd20, Found pending node: GradNodeAccumulation addr: 0x5abef1012670
I1107 11:51:36.019003 564858 backward.cc:364] Sum or Move grad inputs for edge slot: 0, rank: 0
I1107 11:51:36.019008 564858 backward.cc:254] Preparing GradNode:GradNodeAccumulation addr:0x5abef1012670
I1107 11:51:36.019012 564858 accumulation_node.cc:157] Running AD API Grad: GradNodeAccumulation
I1107 11:51:36.019016 564858 accumulation_node.cc:40] Move Tensor ptr: 0x5abefdd6c450
I1107 11:51:36.019021 564858 reducer.cc:768] Tensor[0] [linear_0.w_0@Grad] arrived and triggered disthook
I1107 11:51:36.019024 564858 reducer.cc:784] Tensor[0][linear_0.w_0] is marked ready.
I1107 11:51:36.019032 564858 reducer.cc:933] Group[0] is ready
I1107 11:51:36.019035 564858 reducer.cc:1073] group [0] start fused_allreduce.
I1107 11:51:36.059813 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=121, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:36.060494 564858 process_group_nccl.cc:238] [ncclAllReduce] sendbuff: 0x736062e62200, recvbuff: 0x736062e62200, count: 121, datatype: float32, redop: SUM, ncclcomm: 0x5abef19d7f50, stream: 0x5abeedaf5960, rank_in_group: 0, nranks: 4, sync_op: 0, use_calc_stream: 0rank_in_group: 0, nranks: 4, gid: 0, backend: NCCL
I1107 11:51:36.060643 564858 reducer.cc:429] Free densecontents 121
I1107 11:51:36.060663 564858 reducer.cc:1064] In the batch, Reducer is finished.
I1107 11:51:36.060668 564858 accumulation_node.cc:193] Finish AD API Grad: GradNodeAccumulation
I1107 11:51:36.060672 564858 backward.cc:294] retain_graph is false, need to clear the TensorWrapper of nodes.
I1107 11:51:36.061123 564858 eager.cc:119] Tensor(learning_rate_0) have not GradNode, add GradNodeAccumulation0x5abf09987670 for it.
I1107 11:51:36.061206 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.061219 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.061254 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e62400), and remaining 0
I1107 11:51:36.061264 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:36.061373 564858 eager.cc:119] Tensor(linear_0.w_0_moment1_0) have not GradNode, add GradNodeAccumulation0x5abf09988610 for it.
I1107 11:51:36.061429 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.061437 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.061457 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e62600), and remaining 0
I1107 11:51:36.061465 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=100, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:36.061496 564858 eager.cc:119] Tensor(linear_0.w_0_moment2_0) have not GradNode, add GradNodeAccumulation0x5abf099898a0 for it.
I1107 11:51:36.061529 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.061538 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.061549 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 512(0x736062e62800), and remaining 0
I1107 11:51:36.061555 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=100, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:36.061584 564858 eager.cc:119] Tensor(linear_0.w_0_beta1_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf0998aad0 for it.
I1107 11:51:36.061631 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.061638 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.061722 564858 eager.cc:119] Tensor(linear_0.w_0_beta2_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf0998c030 for it.
I1107 11:51:36.061735 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.061740 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.061779 564858 eager.cc:119] Tensor(linear_0.b_0_moment1_0) have not GradNode, add GradNodeAccumulation0x5abf0998caa0 for it.
I1107 11:51:36.061818 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.061825 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.061838 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e62a00), and remaining 0
I1107 11:51:36.061844 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:36.061872 564858 eager.cc:119] Tensor(linear_0.b_0_moment2_0) have not GradNode, add GradNodeAccumulation0x5abf0998dea0 for it.
I1107 11:51:36.061904 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.061910 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.061920 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e62c00), and remaining 0
I1107 11:51:36.061926 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:36.061950 564858 eager.cc:119] Tensor(linear_0.b_0_beta1_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf0998f3a0 for it.
I1107 11:51:36.061964 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.061969 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.061996 564858 eager.cc:119] Tensor(linear_0.b_0_beta2_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf099902e0 for it.
I1107 11:51:36.062009 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.062014 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.062047 564858 eager.cc:119] Tensor(linear_1.w_0_moment1_0) have not GradNode, add GradNodeAccumulation0x5abf09991030 for it.
I1107 11:51:36.062083 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.062088 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.062103 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e62e00), and remaining 0
I1107 11:51:36.062110 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:36.062135 564858 eager.cc:119] Tensor(linear_1.w_0_moment2_0) have not GradNode, add GradNodeAccumulation0x5abf099923e0 for it.
I1107 11:51:36.062165 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.062172 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.062186 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e63000), and remaining 0
I1107 11:51:36.062192 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=10, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:36.062227 564858 eager.cc:119] Tensor(linear_1.w_0_beta1_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf09993d50 for it.
I1107 11:51:36.062239 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.062245 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.062264 564858 eager.cc:119] Tensor(linear_1.w_0_beta2_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf09994a60 for it.
I1107 11:51:36.062276 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.062280 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.062314 564858 eager.cc:119] Tensor(linear_1.b_0_moment1_0) have not GradNode, add GradNodeAccumulation0x5abf099958e0 for it.
I1107 11:51:36.062346 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.062352 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.062364 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e63200), and remaining 0
I1107 11:51:36.062369 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:36.062392 564858 eager.cc:119] Tensor(linear_1.b_0_moment2_0) have not GradNode, add GradNodeAccumulation0x5abf09996de0 for it.
I1107 11:51:36.062422 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.062427 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.062439 564858 auto_growth_best_fit_allocator.cc:122] Not found and reallocate 256(0x736062e63400), and remaining 0
I1107 11:51:36.062445 564858 gpu_launch_config.h:156] Get 1-D launch config: numel=1, vec_size=4, block_size=64, grid_size=1, limit_blocks=2147483647, limit_threads=512
I1107 11:51:36.062469 564858 eager.cc:119] Tensor(linear_1.b_0_beta1_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf09998310 for it.
I1107 11:51:36.062479 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.062485 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.062510 564858 eager.cc:119] Tensor(linear_1.b_0_beta2_pow_acc_0) have not GradNode, add GradNodeAccumulation0x5abf099992e0 for it.
I1107 11:51:36.062520 564858 dygraphfunctions.cc:59543] Running AD API: full
I1107 11:51:36.062526 564858 dygraph_functions.cc:59583] { Input: [
( output , [[ Not specified tensor log level ]]), ]}
I1107 11:51:36.062577 564858 dygraphfunctions.cc:2692] Running AD API: adam
I1107 11:51:36.062600 564858 dygraph_functions.cc:2777] { Input: [
( param , [[ Not specified tensor log level ]]),
( grad , [[ Not specified tensor log level ]]),
( learning_rate , [[ Not specified tensor log level ]]),
( moment1 , [[ Not specified tensor log level ]]),
( moment2 , [[ Not specified tensor log level ]]),
( beta1_pow , [[ Not specified tensor log level ]]),
( beta2_pow , [[ Not specified tensor log level ]]),
( master_param , [{ UnDefinedTensor }]),
( skip_update , [{ UnDefinedTensor }]), ]}
I1107 11:51:36.062630 564858 multiary.cc:184] dims of Beta1Pow : [1]
I1107 11:51:36.062635 564858 multiary.cc:192] dims of Beta2Pow : [1]
I1107 11:51:36.062642 564858 adam_kernel.cu:187] beta1_pow.numel() : 1beta2_pow.numel() : 1
I1107 11:51:36.062647 564858 adam_kernel.cu:189] param.numel(): 100
请提出你的问题 Please ask your question
我用的是官网的镜像registry.baidubce.com/paddlepaddle/paddle:latest-dev-cuda12.0-cudnn8.9-trt8.6-gcc12.2 安装paddlepaddle-gpu==2.6.1.post120 测试4卡分布式训练paddle.utils.run_check()卡住
Python 3.9.18 (main, Aug 25 2023, 13:20:04) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information.