Closed jidlin closed 2 years ago
每台机都是两张 GPU 吗?能发一下 log 吗?
每台机都是两张 GPU 吗?能发一下 log 吗?
INFO 2021-09-18 15:41:49,245 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 W0918 15:41:50.323863 21635 nccl_context.cc:142] Socket connect worker 10.130.17.157:6070 failed, try again after 3 seconds. W0918 15:41:53.324177 21635 nccl_context.cc:142] Socket connect worker 10.130.17.157:6070 failed, try again after 6 seconds. W0918 15:41:59.324524 21635 nccl_context.cc:142] Socket connect worker 10.130.17.157:6070 failed, try again after 9 seconds. W0918 15:42:08.324882 21635 nccl_context.cc:142] Socket connect worker 10.130.17.157:6070 failed, try again after 12 seconds. I0918 15:42:20.325516 21635 nccl_context.cc:189] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0 W0918 15:42:21.382360 21635 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.2 W0918 15:42:21.384966 21635 device_context.cc:372] device: 0, cuDNN Version: 8.0. Training is start.
每台机都是两张 GPU 吗?能发一下 log 吗?
日志没有太多信息,两台机器都运行脚本之后nccl通信应该建立了,但是不开始训练
另外3张卡的 log 能也发一下吗? 两台机的 log/workerlog.*
另外3张卡的 log 能也发一下吗? 两台机的 log/workerlog.*
之前的日志被删了,这会儿没有GPU,我测试了一次2机2卡也不行: 配置:distributed_args="--ips 10.130.22.204,10.130.17.157 --selected_gpus 0"
{
"is_distributed": true,
"save_path": "./output",
"train_file": "./data/example/train_filelist",
"valid_file": "./data/example/valid_filelist",
"start_step": 0,
"num_epochs": 5,
"log_steps": 1,
"validation_steps": 1000,
"save_steps": 5000,
"eval_metric": "-loss",
"save_checkpoint": true,
"Model": {
"model": "UnifiedTransformer",
"config_path": "./projects/PLATO-2/12L.json",
"init_checkpoint": "",
"init_pretraining_params": "",
"optimizer": "AdamW",
"learning_rate": 0.001,
"warmup_steps": 4000,
"lr_scheduler": "noam",
"max_training_steps": 2000,
"min_learning_rate": 0,
"weight_decay": 0.01,
"max_grad_norm": 0.1,
"use_recompute": false,
"use_amp": true,
"amp_loss_scaling": 32768.0,
"weight_sharing": true,
"mem_efficient": false,
"use_role": false,
"pre_encoder_cmd": "d",
"preprocess_cmd": "n",
"postprocess_cmd": "da",
"post_cls_cmd": "n",
"cls_bias": true,
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"latent_type_size": 20,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"role_type_size": 32,
"vocab_size": 30001
},
"Generator": {
"min_dec_len": 1,
"max_dec_len": 64,
"decoding_strategy": "topk_sampling",
"temperature": 1.0,
"ignore_unk": true,
"num_samples": null,
"topk": 10,
"topp": 0.9,
"beam_size": 10,
"length_average": true,
"length_penalty": 0.0
},
"Task": {
"task": "DialogGeneration",
"do_generation": true,
"is_cn": false,
"filter_cross_repetition": true,
"nsp_inference_model_path": null,
"ranking_score": "decode_score"
},
"Reader": {
"max_src_len": 128,
"max_tgt_len": 128,
"max_seq_len": 256,
"max_knowledge_len": 0,
"knowledge_position": "post_src",
"knowledge_style": "original",
"truncate_first_turn": false,
"file_format": "filelist",
"data_format": "numerical",
"in_tokens": true,
"batch_size": 16000,
"position_style": "continuous",
"random_seed": 11,
"shuffle_pool_size": 65536,
"sort_pool_size": 0
},
"Tokenizer": {
"tokenizer": "SentencePieceTokenizer",
"vocab_path": "./package/dialog_cn/vocab.txt",
"specials_path": "",
"do_lower_case": false,
"spm_model_file": "./package/dialog_cn/spm.model"
}
}
+==============================================================================+
| |
| DistributedStrategy Overview |
| |
+==============================================================================+
| amp=True <-> amp_configs |
+------------------------------------------------------------------------------+
| init_loss_scaling 32768.0 |
| incr_every_n_steps 1000 |
| decr_every_n_nan_or_inf 2 |
| incr_ratio 2.0 |
| decr_ratio 0.800000011920929 |
| use_dynamic_loss_scaling True |
| use_pure_fp16 False |
| use_fp16_guard True |
+==============================================================================+
| a_sync=True <-> a_sync_configs |
+------------------------------------------------------------------------------+
| k_steps -1 |
| max_merge_var_num 1 |
| send_queue_size 16 |
| independent_recv_thread False |
| min_send_grad_num_before_recv 1 |
| thread_pool_size 1 |
| send_wait_times 1 |
| runtime_split_send_recv False |
| launch_barrier True |
| heter_worker_device_guard cpu |
| lr_decay_steps 10 |
+==============================================================================+
| Environment Flags, Communication Flags |
+------------------------------------------------------------------------------+
| mode 1 |
| elastic False |
| auto False |
| sync_nccl_allreduce True |
| nccl_comm_num 1 |
| use_hierarchical_allreduce False |
| hierarchical_allreduce_inter_nranks 1 |
| sync_batch_norm False |
| fuse_all_reduce_ops True |
| fuse_grad_size_in_MB 32 |
| fuse_grad_size_in_TFLOPS 50.0 |
| cudnn_exhaustive_search False |
| conv_workspace_size_limit 512 |
| cudnn_batchnorm_spatial_persistent False |
| fp16_allreduce False |
| last_comm_group_size_MB 1.0 |
+==============================================================================+
| Build Strategy |
+------------------------------------------------------------------------------+
| enable_sequential_execution False |
| fuse_elewise_add_act_ops False |
| fuse_bn_act_ops False |
| fuse_relu_depthwise_conv False |
| fuse_broadcast_ops False |
| fuse_all_optimizer_ops False |
| enable_inplace False |
| enable_backward_optimizer_op_deps True |
| cache_runtime_context False |
| fuse_bn_add_act_ops True |
| enable_auto_fusion False |
| enable_addto False |
+==============================================================================+
| Execution Strategy |
+------------------------------------------------------------------------------+
| num_threads 4 |
| num_iteration_per_drop_scope 1 |
| num_iteration_per_run 1 |
| use_thread_barrier False |
+==============================================================================+
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/models/unified_transformer.py:143
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:113
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:214
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/distributed/fleet/base/fleet_base.py:632: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
"It is recommended to use DistributedStrategy "
W0923 16:54:32.660574 3336 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.2
W0923 16:54:32.665084 3336 device_context.cc:372] device: 0, cuDNN Version: 8.0.
I0923 16:54:35.972146 3336 gen_nccl_id_op_helper.cc:176] Server listening on: 10.130.17.157:6070 successful.
Training is start.
=====================
{
"is_distributed": true,
"save_path": "./output",
"train_file": "./data/example/train_filelist",
"valid_file": "./data/example/valid_filelist",
"start_step": 0,
"num_epochs": 5,
"log_steps": 1,
"validation_steps": 1000,
"save_steps": 5000,
"eval_metric": "-loss",
"save_checkpoint": true,
"Model": {
"model": "UnifiedTransformer",
"config_path": "./projects/PLATO-2/12L.json",
"init_checkpoint": "",
"init_pretraining_params": "",
"optimizer": "AdamW",
"learning_rate": 0.001,
"warmup_steps": 4000,
"lr_scheduler": "noam",
"max_training_steps": 2000,
"min_learning_rate": 0,
"weight_decay": 0.01,
"max_grad_norm": 0.1,
"use_recompute": false,
"use_amp": true,
"amp_loss_scaling": 32768.0,
"weight_sharing": true,
"mem_efficient": false,
"use_role": false,
"pre_encoder_cmd": "d",
"preprocess_cmd": "n",
"postprocess_cmd": "da",
"post_cls_cmd": "n",
"cls_bias": true,
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"latent_type_size": 20,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"role_type_size": 32,
"vocab_size": 30001
},
"Generator": {
"min_dec_len": 1,
"max_dec_len": 64,
"decoding_strategy": "topk_sampling",
"temperature": 1.0,
"ignore_unk": true,
"num_samples": null,
"topk": 10,
"topp": 0.9,
"beam_size": 10,
"length_average": true,
"length_penalty": 0.0
},
"Task": {
"task": "DialogGeneration",
"do_generation": true,
"is_cn": false,
"filter_cross_repetition": true,
"nsp_inference_model_path": null,
"ranking_score": "decode_score"
},
"Reader": {
"max_src_len": 128,
"max_tgt_len": 128,
"max_seq_len": 256,
"max_knowledge_len": 0,
"knowledge_position": "post_src",
"knowledge_style": "original",
"truncate_first_turn": false,
"file_format": "filelist",
"data_format": "numerical",
"in_tokens": true,
"batch_size": 16000,
"position_style": "continuous",
"random_seed": 11,
"shuffle_pool_size": 65536,
"sort_pool_size": 0
},
"Tokenizer": {
"tokenizer": "SentencePieceTokenizer",
"vocab_path": "./package/dialog_cn/vocab.txt",
"specials_path": "",
"do_lower_case": false,
"spm_model_file": "./package/dialog_cn/spm.model"
}
}
+==============================================================================+
| |
| DistributedStrategy Overview |
| |
+==============================================================================+
| amp=True <-> amp_configs |
+------------------------------------------------------------------------------+
| init_loss_scaling 32768.0 |
| incr_every_n_steps 1000 |
| decr_every_n_nan_or_inf 2 |
| incr_ratio 2.0 |
| decr_ratio 0.800000011920929 |
| use_dynamic_loss_scaling True |
| use_pure_fp16 False |
| use_fp16_guard True |
+==============================================================================+
| a_sync=True <-> a_sync_configs |
+------------------------------------------------------------------------------+
| k_steps -1 |
| max_merge_var_num 1 |
| send_queue_size 16 |
| independent_recv_thread False |
| min_send_grad_num_before_recv 1 |
| thread_pool_size 1 |
| send_wait_times 1 |
| runtime_split_send_recv False |
| launch_barrier True |
| heter_worker_device_guard cpu |
| lr_decay_steps 10 |
+==============================================================================+
| Environment Flags, Communication Flags |
+------------------------------------------------------------------------------+
| mode 1 |
| elastic False |
| auto False |
| sync_nccl_allreduce True |
| nccl_comm_num 1 |
| use_hierarchical_allreduce False |
| hierarchical_allreduce_inter_nranks 1 |
| sync_batch_norm False |
| fuse_all_reduce_ops True |
| fuse_grad_size_in_MB 32 |
| fuse_grad_size_in_TFLOPS 50.0 |
| cudnn_exhaustive_search False |
| conv_workspace_size_limit 512 |
| cudnn_batchnorm_spatial_persistent False |
| fp16_allreduce False |
| last_comm_group_size_MB 1.0 |
+==============================================================================+
| Build Strategy |
+------------------------------------------------------------------------------+
| enable_sequential_execution False |
| fuse_elewise_add_act_ops False |
| fuse_bn_act_ops False |
| fuse_relu_depthwise_conv False |
| fuse_broadcast_ops False |
| fuse_all_optimizer_ops False |
| enable_inplace False |
| enable_backward_optimizer_op_deps True |
| cache_runtime_context False |
| fuse_bn_add_act_ops True |
| enable_auto_fusion False |
| enable_addto False |
+==============================================================================+
| Execution Strategy |
+------------------------------------------------------------------------------+
| num_threads 4 |
| num_iteration_per_drop_scope 1 |
| num_iteration_per_run 1 |
| use_thread_barrier False |
+==============================================================================+
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/models/unified_transformer.py:143
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:113
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:214
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/distributed/fleet/base/fleet_base.py:632: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
"It is recommended to use DistributedStrategy "
W0923 16:55:08.566850 2449 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.2
W0923 16:55:08.570302 2449 device_context.cc:372] device: 0, cuDNN Version: 8.0.
Training is start.
====================
endpoints.log:
PADDLE_TRAINER_ENDPOINTS:
10.130.22.204:6070
10.130.17.157:6070
Thu Sep 23 17:03:06 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.116.00 Driver Version: 418.116.00 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 |
| N/A 41C P0 72W / 300W | 2744MiB / 32480MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
GPU有一定占用
每台机都是两张 GPU 吗?能发一下 log 吗?
奇怪,邮件收到你回复问paddle版本Git上看不见,就回在这条好了:paddlepaddle-gpu==2.0.1
https://github.com/PaddlePaddle/Knover/blob/ac58d760973cacb163b5dc5e1be0b7c54ca75140/knover/scripts/train.py#L137 可以试下在这里加入一个 debug:print 一下 trainer_id、trainers_num? 我周末测试下,这个设置按我理解应该是可以跑的 还有就是能发下 cuda、cudnn 和 nccl 的版本吗?看看这方面有没有问题
可以试下在这里加入一个 debug:print 一下 trainer_id、trainers_num? 我周末测试下,这个设置按我理解应该是可以跑的 还有就是能发下 cuda、cudnn 和 nccl 的版本吗?看看这方面有没有问题
开了个三机九卡的实验,debug信息好像是正常的,依然不能跑
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
W0926 11:04:27.270151 3984 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.2
W0926 11:04:27.273236 3984 device_context.cc:372] device: 0, cuDNN Version: 8.0.
Training is start.
0 9
============ CUDA Version: 11.0 cuDNN Version: 8.0 nccl 我在服务器/usr/local/cuda-11.0/lib64路径没看见libnccl*相关文件,是不是说明没有安装nccl。。。
可以试下在这里加入一个 debug:print 一下 trainer_id、trainers_num? 我周末测试下,这个设置按我理解应该是可以跑的 还有就是能发下 cuda、cudnn 和 nccl 的版本吗?看看这方面有没有问题
开了个三机九卡的实验,debug信息好像是正常的,依然不能跑 server not ready, wait 3 sec to retry... not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] W0926 11:04:27.270151 3984 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.2 W0926 11:04:27.273236 3984 device_context.cc:372] device: 0, cuDNN Version: 8.0. Training is start. 0 9
============ CUDA Version: 11.0 cuDNN Version: 8.0 nccl 我在服务器/usr/local/cuda-11.0/lib64路径没看见libnccl*相关文件,是不是说明没有安装nccl。。。 通过下面方式查看的nccl版本是2708
import torch torch.cuda.nccl.version() 2708
你试下用 cuda11版本的 paddle ?
python -m pip install paddlepaddle-gpu==2.1.3.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
另外,nccl 的so 可以试着在/usr/local/lib/libnccl*这里找
我这边测试,在 cuda 10.1的环境跑2.0.1是可以跑的哈
你试下用 cuda11版本的 paddle ?
python -m pip install paddlepaddle-gpu==2.1.3.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
另外,nccl 的so 可以试着在/usr/local/lib/libnccl*这里找
cuda11版本的paddle一样的情况,卡在 https://github.com/PaddlePaddle/Knover/blob/ac58d760973cacb163b5dc5e1be0b7c54ca75140/knover/scripts/train.py#L139 data正常读取; workerlog.0日志如下:
+ [[ 1 == 1 ]]
+ job_conf=./projects/PLATO-2/pretrain/12L_train_stage-1.conf
+ source ./projects/PLATO-2/pretrain/12L_train_stage-1.conf
++ job_script=./scripts/distributed/train.sh
++ model=UnifiedTransformer
++ task=DialogGeneration
++ vocab_path=./package/dialog_cn/vocab.txt
++ spm_model_file=./package/dialog_cn/spm.model
++ train_file=./data/example/train_filelist
++ valid_file=./data/example/valid_filelist
++ data_format=numerical
++ file_format=filelist
++ config_path=./projects/PLATO-2/12L.json
++ is_cn=true
++ in_tokens=true
++ batch_size=16000
++ lr=1e-3
++ warmup_steps=4000
++ weight_decay=0.01
++ num_epochs=5
++ distributed_args='--ips 10.130.19.203,10.130.17.157 --selected_gpus 0,1,2'
++ log_steps=1
++ validation_steps=1000
++ save_steps=5000
++ log_dir=./log
++ save_path=./output
+ export FLAGS_sync_nccl_allreduce=1
+ FLAGS_sync_nccl_allreduce=1
+ export FLAGS_fuse_parameter_memory_size=64
+ FLAGS_fuse_parameter_memory_size=64
+ mkdir -p ./output
+ [[ ./log != '' ]]
+ mkdir -p ./log
+ distributed_args='--ips 10.130.19.203,10.130.17.157 --selected_gpus 0,1,2 --log_dir ./log'
+ fleetrun --ips 10.130.19.203,10.130.17.157 --selected_gpus 0,1,2 --log_dir ./log ./knover/scripts/train.py --is_distributed true --model UnifiedTransformer --task
DialogGeneration --vocab_path ./package/dialog_cn/vocab.txt --do_lower_case false --spm_model_file ./package/dialog_cn/spm.model --init_pretraining_params '' --ini
t_checkpoint '' --train_file ./data/example/train_filelist --valid_file ./data/example/valid_filelist --data_format numerical --file_format filelist --config_path .
/projects/PLATO-2/12L.json --in_tokens true --batch_size 16000 --learning_rate 1e-3 --warmup_steps 4000 --weight_decay 0.01 --use_amp true --use_recompute false --n
um_epochs 5 --log_steps 1 --validation_steps 1000 --save_steps 5000 --save_path ./output --random_seed 11
----------- Configuration Arguments -----------
gpus: 0,1,2
heter_worker_num: None
heter_workers:
http_port: None
ips: 10.130.19.203,10.130.17.157
log_dir: ./log
nproc_per_node: None
run_mode: None
server_num: None
servers:
training_script: ./knover/scripts/train.py
training_script_args: ['--is_distributed', 'true', '--model', 'UnifiedTransformer', '--task', 'DialogGeneration', '--vocab_path', './package/dialog_cn/vocab.txt', '--do_lower_case', 'false', '--spm_model_file', './package/dialog_cn/spm.model', '--init_pretraining_params', '', '--init_checkpoint', '', '--train_file', './data/example/train_filelist', '--valid_file', './data/example/valid_filelist', '--data_format', 'numerical', '--file_format', 'filelist', '--config_path', './projects/PLATO-2/12L.json', '--in_tokens', 'true', '--batch_size', '16000', '--learning_rate', '1e-3', '--warmup_steps', '4000', '--weight_decay', '0.01', '--use_amp', 'true', '--use_recompute', 'false', '--num_epochs', '5', '--log_steps', '1', '--validation_steps', '1000', '--save_steps', '5000', '--save_path', './output', '--random_seed', '11']
worker_num: None
workers:
------------------------------------------------
INFO 2021-09-26 21:17:09,515 launch.py:348] Run collective mode. gpu arguments:['--ips'], cuda count:3
launch train in GPU mode!
INFO 2021-09-26 21:17:09,517 launch_utils.py:510] Local start 3 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 10.130.19.203:6070 |
| PADDLE_TRAINERS_NUM 6 |
| PADDLE_TRAINER_ENDPOINTS ... 070,10.130.17.157:6071,10.130.17.157:6072|
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0,1,2,0,1,2 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+
INFO 2021-09-26 21:17:09,517 launch_utils.py:514] details abouts PADDLE_TRAINER_ENDPOINTS can be found in ./log/endpoints.log, and detail running logs maybe found in ./log/workerlog.0
launch proc_id:2314 idx:0
launch proc_id:2319 idx:1
launch proc_id:2324 idx:2
{
"is_distributed": true,
"save_path": "./output",
"train_file": "./data/example/train_filelist",
"valid_file": "./data/example/valid_filelist",
"start_step": 0,
"num_epochs": 5,
"log_steps": 1,
"validation_steps": 1000,
"save_steps": 5000,
"eval_metric": "-loss",
"save_checkpoint": true,
"Model": {
"model": "UnifiedTransformer",
"config_path": "./projects/PLATO-2/12L.json",
"init_checkpoint": "",
"init_pretraining_params": "",
"optimizer": "AdamW",
"learning_rate": 0.001,
"warmup_steps": 4000,
"lr_scheduler": "noam",
"max_training_steps": 2000,
"min_learning_rate": 0,
"weight_decay": 0.01,
"max_grad_norm": 0.1,
"use_recompute": false,
"use_amp": true,
"amp_loss_scaling": 32768.0,
"weight_sharing": true,
"mem_efficient": false,
"use_role": false,
"pre_encoder_cmd": "d",
"preprocess_cmd": "n",
"postprocess_cmd": "da",
"post_cls_cmd": "n",
"cls_bias": true,
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"max_position_embeddings": 512,
"latent_type_size": 20,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"role_type_size": 32,
"vocab_size": 30001
},
"Generator": {
"min_dec_len": 1,
"max_dec_len": 64,
"decoding_strategy": "topk_sampling",
"temperature": 1.0,
"ignore_unk": true,
"num_samples": null,
"topk": 10,
"topp": 0.9,
"beam_size": 10,
"length_average": true,
"length_penalty": 0.0
},
"Task": {
"task": "DialogGeneration",
"do_generation": true,
"is_cn": false,
"filter_cross_repetition": true,
"nsp_inference_model_path": null,
"ranking_score": "decode_score"
},
"Reader": {
"max_src_len": 128,
"max_tgt_len": 128,
"max_seq_len": 256,
"max_knowledge_len": 0,
"knowledge_position": "post_src",
"knowledge_style": "original",
"truncate_first_turn": false,
"file_format": "filelist",
"data_format": "numerical",
"in_tokens": true,
"batch_size": 16000,
"position_style": "continuous",
"random_seed": 11,
"shuffle_pool_size": 65536,
"sort_pool_size": 0
},
"Tokenizer": {
"tokenizer": "SentencePieceTokenizer",
"vocab_path": "./package/dialog_cn/vocab.txt",
"specials_path": "",
"do_lower_case": false,
"spm_model_file": "./package/dialog_cn/spm.model"
}
}
+==============================================================================+
| |
| DistributedStrategy Overview |
| |
+==============================================================================+
| amp=True <-> amp_configs |
+------------------------------------------------------------------------------+
| init_loss_scaling 32768.0 |
| incr_every_n_steps 1000 |
| decr_every_n_nan_or_inf 2 |
| incr_ratio 2.0 |
| decr_ratio 0.800000011920929 |
| use_dynamic_loss_scaling True |
| use_pure_fp16 False |
| use_fp16_guard True |
+==============================================================================+
| a_sync=True <-> a_sync_configs |
+------------------------------------------------------------------------------+
| k_steps -1 |
| max_merge_var_num 1 |
| send_queue_size 16 |
| independent_recv_thread False |
| min_send_grad_num_before_recv 1 |
| thread_pool_size 1 |
| send_wait_times 1 |
| runtime_split_send_recv False |
| launch_barrier True |
| heter_worker_device_guard cpu |
| lr_decay_steps 10 |
| use_ps_gpu 0 |
+==============================================================================+
| Environment Flags, Communication Flags |
+------------------------------------------------------------------------------+
| mode 1 |
| elastic False |
| auto False |
| sync_nccl_allreduce True |
| nccl_comm_num 1 |
| use_hierarchical_allreduce False |
| hierarchical_allreduce_inter_nranks 1 |
| sync_batch_norm False |
| fuse_all_reduce_ops True |
| fuse_grad_size_in_MB 32 |
| fuse_grad_size_in_TFLOPS 50.0 |
| cudnn_exhaustive_search False |
| conv_workspace_size_limit 512 |
| cudnn_batchnorm_spatial_persistent False |
| fp16_allreduce False |
| last_comm_group_size_MB 1.0 |
| find_unused_parameters False |
| without_graph_optimization False |
+==============================================================================+
| Build Strategy |
+------------------------------------------------------------------------------+
| enable_sequential_execution False |
| fuse_elewise_add_act_ops False |
| fuse_bn_act_ops False |
| fuse_relu_depthwise_conv False |
| fuse_broadcast_ops False |
| fuse_all_optimizer_ops False |
| enable_inplace False |
| enable_backward_optimizer_op_deps True |
| cache_runtime_context False |
| fuse_bn_add_act_ops True |
| enable_auto_fusion False |
| enable_addto False |
+==============================================================================+
| Execution Strategy |
+------------------------------------------------------------------------------+
| num_threads 4 |
| num_iteration_per_drop_scope 1 |
| num_iteration_per_run 1 |
| use_thread_barrier False |
+==============================================================================+
/home/tiger/.local/lib/python3.7/site-packages/paddle/fluid/layers/math_op_patch.py:322: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/models/unified_transformer.py:143
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/home/tiger/.local/lib/python3.7/site-packages/paddle/fluid/layers/math_op_patch.py:322: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:113
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/home/tiger/.local/lib/python3.7/site-packages/paddle/fluid/layers/math_op_patch.py:322: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:214
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/home/tiger/.local/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py:707: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
"It is recommended to use DistributedStrategy "
W0926 21:17:13.738911 2314 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 11.0
W0926 21:17:13.741997 2314 device_context.cc:422] device: 0, cuDNN Version: 8.0.
W0926 21:17:21.021950 2314 gen_comm_id_helper.cc:120] connect addr=10.130.19.203:6072 failed 1 times with reason: Connection refused retry after 0.5 seconds
W0926 21:17:21.522388 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 1 times with reason: Connection refused retry after 0.5 seconds
W0926 21:17:22.022749 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 2 times with reason: Connection refused retry after 1 seconds
W0926 21:17:23.023144 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 3 times with reason: Connection refused retry after 1.5 seconds
W0926 21:17:24.523504 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 4 times with reason: Connection refused retry after 2 seconds
W0926 21:17:26.523833 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 5 times with reason: Connection refused retry after 2.5 seconds
W0926 21:17:29.024152 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 6 times with reason: Connection refused retry after 3 seconds
W0926 21:17:32.024435 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 7 times with reason: Connection refused retry after 3 seconds
W0926 21:17:35.024766 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 8 times with reason: Connection refused retry after 3 seconds
W0926 21:17:38.025146 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 9 times with reason: Connection refused retry after 3 seconds
W0926 21:17:41.025440 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 10 times with reason: Connection refused retry after 3 seconds
W0926 21:17:44.025799 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 11 times with reason: Connection refused retry after 3 seconds
W0926 21:17:47.026120 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 12 times with reason: Connection refused retry after 3 seconds
W0926 21:17:50.026476 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 13 times with reason: Connection refused retry after 3 seconds
W0926 21:17:53.026827 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 14 times with reason: Connection refused retry after 3 seconds
W0926 21:17:56.027174 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 15 times with reason: Connection refused retry after 3 seconds
W0926 21:17:59.027515 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 16 times with reason: Connection refused retry after 3 seconds
W0926 21:18:02.027846 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 17 times with reason: Connection refused retry after 3 seconds
W0926 21:18:05.028201 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 18 times with reason: Connection refused retry after 3 seconds
W0926 21:18:08.028540 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 19 times with reason: Connection refused retry after 3 seconds
W0926 21:18:11.028873 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 20 times with reason: Connection refused retry after 3 seconds
W0926 21:18:14.029444 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 21 times with reason: Connection refused retry after 3 seconds
W0926 21:18:17.029763 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 22 times with reason: Connection refused retry after 3 seconds
W0926 21:18:20.030058 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 23 times with reason: Connection refused retry after 3 seconds
W0926 21:18:23.030386 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 24 times with reason: Connection refused retry after 3 seconds
W0926 21:18:26.030699 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 25 times with reason: Connection refused retry after 3 seconds
W0926 21:18:29.031008 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 26 times with reason: Connection refused retry after 3 seconds
W0926 21:18:32.031347 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 27 times with reason: Connection refused retry after 3 seconds
W0926 21:18:35.031663 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 28 times with reason: Connection refused retry after 3 seconds
W0926 21:18:38.032002 2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 29 times with reason: Connection refused retry after 3 seconds
Training is start.
0 6
[{'tgt_label': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7068>, 'generation_mask': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7340>, 'pos_ids': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7500>, 'token_ids': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7378>, 'tgt_idx': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7148>, 'type_ids': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7570>}]
最后发现确实是nccl环境问题,需要设置使用rdma通信的环境变量,辛苦帮忙排查问题,我关闭issue了~
你试下用 cuda11版本的 paddle ?
python -m pip install paddlepaddle-gpu==2.1.3.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
另外,nccl 的so 可以试着在/usr/local/lib/libnccl*这里找
我这边测试,在 cuda 10.1的环境跑2.0.1是可以跑的哈
多机多卡已经跑通了,有个另外的问题想问下,如果用3台机器进行训练,是每台机器都要放全量数据吗,还是需要手动切割数据集为3份
你试下用 cuda11版本的 paddle ?
python -m pip install paddlepaddle-gpu==2.1.3.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html 另外,nccl 的so 可以试着在/usr/local/lib/libnccl*这里找 我这边测试,在 cuda 10.1的环境跑2.0.1是可以跑的哈
多机多卡已经跑通了,有个另外的问题想问下,如果用3台机器进行训练,是每台机器都要放全量数据吗,还是需要手动切割数据集为3份
然后paddle支持从hdfs读取数据吗
最后发现确实是nccl环境问题,需要设置使用rdma通信的环境变量,辛苦帮忙排查问题,我关闭issue了~
NCCL 可以设置环境变量看 debug 信息:export NCCL_DEBUG=INFO
多机多卡已经跑通了,有个另外的问题想问下,如果用3台机器进行训练,是每台机器都要放全量数据吗,还是需要手动切割数据集为3份
每台机都放全量数据,knover 会对数据完成自动的切分
然后paddle支持从hdfs读取数据吗
不知道你是不是想直接通过 python 的库来访问 hdfs 的数据(配置 hdfs_name + hdfs_ugi + hdfs_path)?现在这部分还没有支持
最后发现确实是nccl环境问题,需要设置使用rdma通信的环境变量,辛苦帮忙排查问题,我关闭issue了~
NCCL 可以设置环境变量看 debug 信息:export NCCL_DEBUG=INFO
多机多卡已经跑通了,有个另外的问题想问下,如果用3台机器进行训练,是每台机器都要放全量数据吗,还是需要手动切割数据集为3份
每台机都放全量数据,knover 会对数据完成自动的切分
然后paddle支持从hdfs读取数据吗
不知道你是不是想直接通过 python 的库来访问 hdfs 的数据(配置 hdfs_name + hdfs_ugi + hdfs_path)?现在这部分还没有支持
对,tf支持从hdfs读数据集,对于大数据量,每台机器本地都放全量数据感觉有点麻烦
我用的是厂内的工具,直接挂载到本地,所以就没有支持hdfs 的功能,如果需要支持的话,可以先简单地在下面的函数里改一下 读入的代码: https://github.com/PaddlePaddle/Knover/blob/ac58d760973cacb163b5dc5e1be0b7c54ca75140/knover/data/dialog_reader.py#L346 参考 Python 库:https://hdfscli.readthedocs.io/en/latest/quickstart.html#python-bindings
配置里按照paddle分布式教程设置为:distributed_args="--ips 10.130.19.203,10.130.17.157 --selected_gpus 0,1",两台机器可以建立通信但是不开始训练,GPU每张卡有2g内存占用, 下面这种配置可以正常训练:distributed_args="--ips 10.130.19.203 --selected_gpus 0,1",