想问下分布式训练有什么特殊设置吗，单机多卡可以跑通，多机多卡可以建立通信但是不报错也不训练

配置里按照paddle分布式教程设置为：distributed_args="--ips 10.130.19.203,10.130.17.157 --selected_gpus 0,1"，两台机器可以建立通信但是不开始训练，GPU每张卡有2g内存占用，下面这种配置可以正常训练：distributed_args="--ips 10.130.19.203 --selected_gpus 0,1"，

每台机都是两张 GPU 吗？能发一下 log 吗？

每台机都是两张 GPU 吗？能发一下 log 吗？

INFO 2021-09-18 15:41:49,245 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 W0918 15:41:50.323863 21635 nccl_context.cc:142] Socket connect worker 10.130.17.157:6070 failed, try again after 3 seconds. W0918 15:41:53.324177 21635 nccl_context.cc:142] Socket connect worker 10.130.17.157:6070 failed, try again after 6 seconds. W0918 15:41:59.324524 21635 nccl_context.cc:142] Socket connect worker 10.130.17.157:6070 failed, try again after 9 seconds. W0918 15:42:08.324882 21635 nccl_context.cc:142] Socket connect worker 10.130.17.157:6070 failed, try again after 12 seconds. I0918 15:42:20.325516 21635 nccl_context.cc:189] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0 W0918 15:42:21.382360 21635 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.2 W0918 15:42:21.384966 21635 device_context.cc:372] device: 0, cuDNN Version: 8.0. Training is start.

每台机都是两张 GPU 吗？能发一下 log 吗？

日志没有太多信息，两台机器都运行脚本之后nccl通信应该建立了，但是不开始训练

另外3张卡的 log 能也发一下吗？两台机的 log/workerlog.*

另外3张卡的 log 能也发一下吗？两台机的 log/workerlog.*

之前的日志被删了，这会儿没有GPU，我测试了一次2机2卡也不行：配置：distributed_args="--ips 10.130.22.204,10.130.17.157 --selected_gpus 0"

{
  "is_distributed": true,
  "save_path": "./output",
  "train_file": "./data/example/train_filelist",
  "valid_file": "./data/example/valid_filelist",
  "start_step": 0,
  "num_epochs": 5,
  "log_steps": 1,
  "validation_steps": 1000,
  "save_steps": 5000,
  "eval_metric": "-loss",
  "save_checkpoint": true,
  "Model": {
    "model": "UnifiedTransformer",
    "config_path": "./projects/PLATO-2/12L.json",
    "init_checkpoint": "",
    "init_pretraining_params": "",
    "optimizer": "AdamW",
    "learning_rate": 0.001,
    "warmup_steps": 4000,
    "lr_scheduler": "noam",
    "max_training_steps": 2000,
    "min_learning_rate": 0,
    "weight_decay": 0.01,
    "max_grad_norm": 0.1,
    "use_recompute": false,
    "use_amp": true,
    "amp_loss_scaling": 32768.0,
    "weight_sharing": true,
    "mem_efficient": false,
    "use_role": false,
    "pre_encoder_cmd": "d",
    "preprocess_cmd": "n",
    "postprocess_cmd": "da",
    "post_cls_cmd": "n",
    "cls_bias": true,
    "attention_probs_dropout_prob": 0.1,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "max_position_embeddings": 512,
    "latent_type_size": 20,
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "type_vocab_size": 2,
    "role_type_size": 32,
    "vocab_size": 30001
  },
  "Generator": {
    "min_dec_len": 1,
    "max_dec_len": 64,
    "decoding_strategy": "topk_sampling",
    "temperature": 1.0,
    "ignore_unk": true,
    "num_samples": null,
    "topk": 10,
    "topp": 0.9,
    "beam_size": 10,
    "length_average": true,
    "length_penalty": 0.0
  },
  "Task": {
    "task": "DialogGeneration",
    "do_generation": true,
    "is_cn": false,
    "filter_cross_repetition": true,
    "nsp_inference_model_path": null,
    "ranking_score": "decode_score"
  },
  "Reader": {
    "max_src_len": 128,
    "max_tgt_len": 128,
    "max_seq_len": 256,
    "max_knowledge_len": 0,
    "knowledge_position": "post_src",
    "knowledge_style": "original",
    "truncate_first_turn": false,
    "file_format": "filelist",
    "data_format": "numerical",
    "in_tokens": true,
    "batch_size": 16000,
    "position_style": "continuous",
    "random_seed": 11,
    "shuffle_pool_size": 65536,
    "sort_pool_size": 0
  },
  "Tokenizer": {
    "tokenizer": "SentencePieceTokenizer",
    "vocab_path": "./package/dialog_cn/vocab.txt",
    "specials_path": "",
    "do_lower_case": false,
    "spm_model_file": "./package/dialog_cn/spm.model"
  }
}
    +==============================================================================+
    |                                                                              |
    |                         DistributedStrategy Overview                         |
    |                                                                              |
    +==============================================================================+
    |                           amp=True <-> amp_configs                           |
    +------------------------------------------------------------------------------+
    |                     init_loss_scaling                 32768.0                |
    |                    incr_every_n_steps                   1000                 |
    |               decr_every_n_nan_or_inf                    2                   |
    |                            incr_ratio                   2.0                  |
    |                            decr_ratio            0.800000011920929           |
    |              use_dynamic_loss_scaling                   True                 |
    |                         use_pure_fp16                  False                 |
    |                        use_fp16_guard                   True                 |
    +==============================================================================+
    |                        a_sync=True <-> a_sync_configs                        |
    +------------------------------------------------------------------------------+
    |                               k_steps                    -1                  |
    |                     max_merge_var_num                    1                   |
    |                       send_queue_size                    16                  |
    |               independent_recv_thread                  False                 |
    |         min_send_grad_num_before_recv                    1                   |
    |                      thread_pool_size                    1                   |
    |                       send_wait_times                    1                   |
    |               runtime_split_send_recv                  False                 |
    |                        launch_barrier                   True                 |
    |             heter_worker_device_guard                   cpu                  |
    |                        lr_decay_steps                    10                  |
    +==============================================================================+
    |                    Environment Flags, Communication Flags                    |
    +------------------------------------------------------------------------------+
    |                                  mode                    1                   |
    |                               elastic                  False                 |
    |                                  auto                  False                 |
    |                   sync_nccl_allreduce                   True                 |
    |                         nccl_comm_num                    1                   |
    |            use_hierarchical_allreduce                  False                 |
    |   hierarchical_allreduce_inter_nranks                    1                   |
    |                       sync_batch_norm                  False                 |
    |                   fuse_all_reduce_ops                   True                 |
    |                  fuse_grad_size_in_MB                    32                  |
    |              fuse_grad_size_in_TFLOPS                   50.0                 |
    |               cudnn_exhaustive_search                  False                 |
    |             conv_workspace_size_limit                   512                  |
    |    cudnn_batchnorm_spatial_persistent                  False                 |
    |                        fp16_allreduce                  False                 |
    |               last_comm_group_size_MB                   1.0                  |
    +==============================================================================+
    |                                Build Strategy                                |
    +------------------------------------------------------------------------------+
    |           enable_sequential_execution                  False                 |
    |              fuse_elewise_add_act_ops                  False                 |
    |                       fuse_bn_act_ops                  False                 |
    |              fuse_relu_depthwise_conv                  False                 |
    |                    fuse_broadcast_ops                  False                 |
    |                fuse_all_optimizer_ops                  False                 |
    |                        enable_inplace                  False                 |
    |     enable_backward_optimizer_op_deps                   True                 |
    |                 cache_runtime_context                  False                 |
    |                   fuse_bn_add_act_ops                   True                 |
    |                    enable_auto_fusion                  False                 |
    |                          enable_addto                  False                 |
    +==============================================================================+
    |                              Execution Strategy                              |
    +------------------------------------------------------------------------------+
    |                           num_threads                    4                   |
    |          num_iteration_per_drop_scope                    1                   |
    |                 num_iteration_per_run                    1                   |
    |                    use_thread_barrier                  False                 |
    +==============================================================================+

/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/models/unified_transformer.py:143
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:113
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:214
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/distributed/fleet/base/fleet_base.py:632: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
  "It is recommended to use DistributedStrategy "
W0923 16:54:32.660574  3336 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.2
W0923 16:54:32.665084  3336 device_context.cc:372] device: 0, cuDNN Version: 8.0.
I0923 16:54:35.972146  3336 gen_nccl_id_op_helper.cc:176] Server listening on: 10.130.17.157:6070 successful.
Training is start.

=====================

{
  "is_distributed": true,
  "save_path": "./output",
  "train_file": "./data/example/train_filelist",
  "valid_file": "./data/example/valid_filelist",
  "start_step": 0,
  "num_epochs": 5,
  "log_steps": 1,
  "validation_steps": 1000,
  "save_steps": 5000,
  "eval_metric": "-loss",
  "save_checkpoint": true,
  "Model": {
    "model": "UnifiedTransformer",
    "config_path": "./projects/PLATO-2/12L.json",
    "init_checkpoint": "",
    "init_pretraining_params": "",
    "optimizer": "AdamW",
    "learning_rate": 0.001,
    "warmup_steps": 4000,
    "lr_scheduler": "noam",
    "max_training_steps": 2000,
    "min_learning_rate": 0,
    "weight_decay": 0.01,
    "max_grad_norm": 0.1,
    "use_recompute": false,
    "use_amp": true,
    "amp_loss_scaling": 32768.0,
    "weight_sharing": true,
    "mem_efficient": false,
    "use_role": false,
    "pre_encoder_cmd": "d",
    "preprocess_cmd": "n",
    "postprocess_cmd": "da",
    "post_cls_cmd": "n",
    "cls_bias": true,
    "attention_probs_dropout_prob": 0.1,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "max_position_embeddings": 512,
    "latent_type_size": 20,
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "type_vocab_size": 2,
    "role_type_size": 32,
    "vocab_size": 30001
  },
  "Generator": {
    "min_dec_len": 1,
    "max_dec_len": 64,
    "decoding_strategy": "topk_sampling",
    "temperature": 1.0,
    "ignore_unk": true,
    "num_samples": null,
    "topk": 10,
    "topp": 0.9,
    "beam_size": 10,
    "length_average": true,
    "length_penalty": 0.0
  },
  "Task": {
    "task": "DialogGeneration",
    "do_generation": true,
    "is_cn": false,
    "filter_cross_repetition": true,
    "nsp_inference_model_path": null,
    "ranking_score": "decode_score"
  },
  "Reader": {
    "max_src_len": 128,
    "max_tgt_len": 128,
    "max_seq_len": 256,
    "max_knowledge_len": 0,
    "knowledge_position": "post_src",
    "knowledge_style": "original",
    "truncate_first_turn": false,
    "file_format": "filelist",
    "data_format": "numerical",
    "in_tokens": true,
    "batch_size": 16000,
    "position_style": "continuous",
    "random_seed": 11,
    "shuffle_pool_size": 65536,
    "sort_pool_size": 0
  },
  "Tokenizer": {
    "tokenizer": "SentencePieceTokenizer",
    "vocab_path": "./package/dialog_cn/vocab.txt",
    "specials_path": "",
    "do_lower_case": false,
    "spm_model_file": "./package/dialog_cn/spm.model"
  }
}
    +==============================================================================+
    |                                                                              |
    |                         DistributedStrategy Overview                         |
    |                                                                              |
    +==============================================================================+
    |                           amp=True <-> amp_configs                           |
    +------------------------------------------------------------------------------+
    |                     init_loss_scaling                 32768.0                |
    |                    incr_every_n_steps                   1000                 |
    |               decr_every_n_nan_or_inf                    2                   |
    |                            incr_ratio                   2.0                  |
    |                            decr_ratio            0.800000011920929           |
    |              use_dynamic_loss_scaling                   True                 |
    |                         use_pure_fp16                  False                 |
    |                        use_fp16_guard                   True                 |
    +==============================================================================+
    |                        a_sync=True <-> a_sync_configs                        |
    +------------------------------------------------------------------------------+
    |                               k_steps                    -1                  |
    |                     max_merge_var_num                    1                   |
    |                       send_queue_size                    16                  |
    |               independent_recv_thread                  False                 |
    |         min_send_grad_num_before_recv                    1                   |
    |                      thread_pool_size                    1                   |
    |                       send_wait_times                    1                   |
    |               runtime_split_send_recv                  False                 |
    |                        launch_barrier                   True                 |
    |             heter_worker_device_guard                   cpu                  |
    |                        lr_decay_steps                    10                  |
    +==============================================================================+
    |                    Environment Flags, Communication Flags                    |
    +------------------------------------------------------------------------------+
    |                                  mode                    1                   |
    |                               elastic                  False                 |
    |                                  auto                  False                 |
    |                   sync_nccl_allreduce                   True                 |
    |                         nccl_comm_num                    1                   |
    |            use_hierarchical_allreduce                  False                 |
    |   hierarchical_allreduce_inter_nranks                    1                   |
    |                       sync_batch_norm                  False                 |
    |                   fuse_all_reduce_ops                   True                 |
    |                  fuse_grad_size_in_MB                    32                  |
    |              fuse_grad_size_in_TFLOPS                   50.0                 |
    |               cudnn_exhaustive_search                  False                 |
    |             conv_workspace_size_limit                   512                  |
    |    cudnn_batchnorm_spatial_persistent                  False                 |
    |                        fp16_allreduce                  False                 |
    |               last_comm_group_size_MB                   1.0                  |
    +==============================================================================+
    |                                Build Strategy                                |
    +------------------------------------------------------------------------------+
    |           enable_sequential_execution                  False                 |
    |              fuse_elewise_add_act_ops                  False                 |
    |                       fuse_bn_act_ops                  False                 |
    |              fuse_relu_depthwise_conv                  False                 |
    |                    fuse_broadcast_ops                  False                 |
    |                fuse_all_optimizer_ops                  False                 |
    |                        enable_inplace                  False                 |
    |     enable_backward_optimizer_op_deps                   True                 |
    |                 cache_runtime_context                  False                 |
    |                   fuse_bn_add_act_ops                   True                 |
    |                    enable_auto_fusion                  False                 |
    |                          enable_addto                  False                 |
    +==============================================================================+
    |                              Execution Strategy                              |
    +------------------------------------------------------------------------------+
    |                           num_threads                    4                   |
    |          num_iteration_per_drop_scope                    1                   |
    |                 num_iteration_per_run                    1                   |
    |                    use_thread_barrier                  False                 |
    +==============================================================================+

/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/models/unified_transformer.py:143
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:113
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:214
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/distributed/fleet/base/fleet_base.py:632: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
  "It is recommended to use DistributedStrategy "
W0923 16:55:08.566850  2449 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.2
W0923 16:55:08.570302  2449 device_context.cc:372] device: 0, cuDNN Version: 8.0.
Training is start.

====================
endpoints.log:
PADDLE_TRAINER_ENDPOINTS: 
10.130.22.204:6070
10.130.17.157:6070

Thu Sep 23 17:03:06 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.116.00   Driver Version: 418.116.00   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   41C    P0    72W / 300W |   2744MiB / 32480MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

GPU有一定占用

每台机都是两张 GPU 吗？能发一下 log 吗？

奇怪，邮件收到你回复问paddle版本Git上看不见，就回在这条好了：paddlepaddle-gpu==2.0.1

https://github.com/PaddlePaddle/Knover/blob/ac58d760973cacb163b5dc5e1be0b7c54ca75140/knover/scripts/train.py#L137 可以试下在这里加入一个 debug：print 一下 trainer_id、trainers_num？我周末测试下，这个设置按我理解应该是可以跑的还有就是能发下 cuda、cudnn 和 nccl 的版本吗？看看这方面有没有问题

https://github.com/PaddlePaddle/Knover/blob/ac58d760973cacb163b5dc5e1be0b7c54ca75140/knover/scripts/train.py#L137

可以试下在这里加入一个 debug：print 一下 trainer_id、trainers_num？我周末测试下，这个设置按我理解应该是可以跑的还有就是能发下 cuda、cudnn 和 nccl 的版本吗？看看这方面有没有问题

开了个三机九卡的实验，debug信息好像是正常的，依然不能跑

server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072']
W0926 11:04:27.270151  3984 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.2
W0926 11:04:27.273236  3984 device_context.cc:372] device: 0, cuDNN Version: 8.0.
Training is start.
0 9

============ CUDA Version: 11.0 cuDNN Version: 8.0 nccl 我在服务器/usr/local/cuda-11.0/lib64路径没看见libnccl*相关文件，是不是说明没有安装nccl。。。

https://github.com/PaddlePaddle/Knover/blob/ac58d760973cacb163b5dc5e1be0b7c54ca75140/knover/scripts/train.py#L137

可以试下在这里加入一个 debug：print 一下 trainer_id、trainers_num？我周末测试下，这个设置按我理解应该是可以跑的还有就是能发下 cuda、cudnn 和 nccl 的版本吗？看看这方面有没有问题

开了个三机九卡的实验，debug信息好像是正常的，依然不能跑 server not ready, wait 3 sec to retry... not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.22.204:6071', '10.130.22.204:6072', '10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.19.203:6070', '10.130.19.203:6071', '10.130.19.203:6072', '10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] server not ready, wait 3 sec to retry... not ready endpoints:['10.130.24.72:6070', '10.130.24.72:6071', '10.130.24.72:6072'] W0926 11:04:27.270151 3984 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.2 W0926 11:04:27.273236 3984 device_context.cc:372] device: 0, cuDNN Version: 8.0. Training is start. 0 9

============ CUDA Version: 11.0 cuDNN Version: 8.0 nccl 我在服务器/usr/local/cuda-11.0/lib64路径没看见libnccl*相关文件，是不是说明没有安装nccl。。。通过下面方式查看的nccl版本是2708

import torch torch.cuda.nccl.version() 2708

你试下用 cuda11版本的 paddle ？

python -m pip install paddlepaddle-gpu==2.1.3.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html

另外，nccl 的so 可以试着在/usr/local/lib/libnccl*这里找

我这边测试，在 cuda 10.1的环境跑2.0.1是可以跑的哈

你试下用 cuda11版本的 paddle ？
python -m pip install paddlepaddle-gpu==2.1.3.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html

另外，nccl 的so 可以试着在/usr/local/lib/libnccl*这里找

cuda11版本的paddle一样的情况，卡在 https://github.com/PaddlePaddle/Knover/blob/ac58d760973cacb163b5dc5e1be0b7c54ca75140/knover/scripts/train.py#L139 data正常读取； workerlog.0日志如下：

+ [[ 1 == 1 ]]
+ job_conf=./projects/PLATO-2/pretrain/12L_train_stage-1.conf
+ source ./projects/PLATO-2/pretrain/12L_train_stage-1.conf
++ job_script=./scripts/distributed/train.sh
++ model=UnifiedTransformer
++ task=DialogGeneration
++ vocab_path=./package/dialog_cn/vocab.txt
++ spm_model_file=./package/dialog_cn/spm.model
++ train_file=./data/example/train_filelist
++ valid_file=./data/example/valid_filelist
++ data_format=numerical
++ file_format=filelist
++ config_path=./projects/PLATO-2/12L.json
++ is_cn=true
++ in_tokens=true
++ batch_size=16000
++ lr=1e-3
++ warmup_steps=4000
++ weight_decay=0.01
++ num_epochs=5
++ distributed_args='--ips 10.130.19.203,10.130.17.157 --selected_gpus 0,1,2'
++ log_steps=1
++ validation_steps=1000
++ save_steps=5000
++ log_dir=./log
++ save_path=./output
+ export FLAGS_sync_nccl_allreduce=1
+ FLAGS_sync_nccl_allreduce=1
+ export FLAGS_fuse_parameter_memory_size=64
+ FLAGS_fuse_parameter_memory_size=64
+ mkdir -p ./output
+ [[ ./log != '' ]]
+ mkdir -p ./log
+ distributed_args='--ips 10.130.19.203,10.130.17.157 --selected_gpus 0,1,2 --log_dir ./log'
+ fleetrun --ips 10.130.19.203,10.130.17.157 --selected_gpus 0,1,2 --log_dir ./log ./knover/scripts/train.py --is_distributed true --model UnifiedTransformer --task
 DialogGeneration --vocab_path ./package/dialog_cn/vocab.txt --do_lower_case false --spm_model_file ./package/dialog_cn/spm.model --init_pretraining_params '' --ini
t_checkpoint '' --train_file ./data/example/train_filelist --valid_file ./data/example/valid_filelist --data_format numerical --file_format filelist --config_path .
/projects/PLATO-2/12L.json --in_tokens true --batch_size 16000 --learning_rate 1e-3 --warmup_steps 4000 --weight_decay 0.01 --use_amp true --use_recompute false --n
um_epochs 5 --log_steps 1 --validation_steps 1000 --save_steps 5000 --save_path ./output --random_seed 11
-----------  Configuration Arguments -----------
gpus: 0,1,2
heter_worker_num: None
heter_workers: 
http_port: None
ips: 10.130.19.203,10.130.17.157
log_dir: ./log
nproc_per_node: None
run_mode: None
server_num: None
servers: 
training_script: ./knover/scripts/train.py
training_script_args: ['--is_distributed', 'true', '--model', 'UnifiedTransformer', '--task', 'DialogGeneration', '--vocab_path', './package/dialog_cn/vocab.txt', '--do_lower_case', 'false', '--spm_model_file', './package/dialog_cn/spm.model', '--init_pretraining_params', '', '--init_checkpoint', '', '--train_file', './data/example/train_filelist', '--valid_file', './data/example/valid_filelist', '--data_format', 'numerical', '--file_format', 'filelist', '--config_path', './projects/PLATO-2/12L.json', '--in_tokens', 'true', '--batch_size', '16000', '--learning_rate', '1e-3', '--warmup_steps', '4000', '--weight_decay', '0.01', '--use_amp', 'true', '--use_recompute', 'false', '--num_epochs', '5', '--log_steps', '1', '--validation_steps', '1000', '--save_steps', '5000', '--save_path', './output', '--random_seed', '11']
worker_num: None
workers: 
------------------------------------------------
INFO 2021-09-26 21:17:09,515 launch.py:348] Run collective mode. gpu arguments:['--ips'], cuda count:3
launch train in GPU mode!
INFO 2021-09-26 21:17:09,517 launch_utils.py:510] Local start 3 processes. First process distributed environment info (Only For Debug): 
    +=======================================================================================+
    |                        Distributed Envs                      Value                    |
    +---------------------------------------------------------------------------------------+
    |                       PADDLE_TRAINER_ID                        0                      |
    |                 PADDLE_CURRENT_ENDPOINT               10.130.19.203:6070              |
    |                     PADDLE_TRAINERS_NUM                        6                      |
    |                PADDLE_TRAINER_ENDPOINTS  ... 070,10.130.17.157:6071,10.130.17.157:6072|
    |                     PADDLE_RANK_IN_NODE                        0                      |
    |                 PADDLE_LOCAL_DEVICE_IDS                        0                      |
    |                 PADDLE_WORLD_DEVICE_IDS                   0,1,2,0,1,2                 |
    |                     FLAGS_selected_gpus                        0                      |
    |             FLAGS_selected_accelerators                        0                      |
    +=======================================================================================+

INFO 2021-09-26 21:17:09,517 launch_utils.py:514] details abouts PADDLE_TRAINER_ENDPOINTS can be found in ./log/endpoints.log, and detail running logs maybe found in ./log/workerlog.0
launch proc_id:2314 idx:0
launch proc_id:2319 idx:1
launch proc_id:2324 idx:2
{
  "is_distributed": true,
  "save_path": "./output",
  "train_file": "./data/example/train_filelist",
  "valid_file": "./data/example/valid_filelist",
  "start_step": 0,
  "num_epochs": 5,
  "log_steps": 1,
  "validation_steps": 1000,
  "save_steps": 5000,
  "eval_metric": "-loss",
  "save_checkpoint": true,
  "Model": {
    "model": "UnifiedTransformer",
    "config_path": "./projects/PLATO-2/12L.json",
    "init_checkpoint": "",
    "init_pretraining_params": "",
    "optimizer": "AdamW",
    "learning_rate": 0.001,
    "warmup_steps": 4000,
    "lr_scheduler": "noam",
    "max_training_steps": 2000,
    "min_learning_rate": 0,
    "weight_decay": 0.01,
    "max_grad_norm": 0.1,
    "use_recompute": false,
    "use_amp": true,
    "amp_loss_scaling": 32768.0,
    "weight_sharing": true,
    "mem_efficient": false,
    "use_role": false,
    "pre_encoder_cmd": "d",
    "preprocess_cmd": "n",
    "postprocess_cmd": "da",
    "post_cls_cmd": "n",
    "cls_bias": true,
    "attention_probs_dropout_prob": 0.1,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "max_position_embeddings": 512,
    "latent_type_size": 20,
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "type_vocab_size": 2,
    "role_type_size": 32,
    "vocab_size": 30001
  },
  "Generator": {
    "min_dec_len": 1,
    "max_dec_len": 64,
    "decoding_strategy": "topk_sampling",
    "temperature": 1.0,
    "ignore_unk": true,
    "num_samples": null,
    "topk": 10,
    "topp": 0.9,
    "beam_size": 10,
    "length_average": true,
    "length_penalty": 0.0
  },
  "Task": {
    "task": "DialogGeneration",
    "do_generation": true,
    "is_cn": false,
    "filter_cross_repetition": true,
    "nsp_inference_model_path": null,
    "ranking_score": "decode_score"
  },
  "Reader": {
    "max_src_len": 128,
    "max_tgt_len": 128,
    "max_seq_len": 256,
    "max_knowledge_len": 0,
    "knowledge_position": "post_src",
    "knowledge_style": "original",
    "truncate_first_turn": false,
    "file_format": "filelist",
    "data_format": "numerical",
    "in_tokens": true,
    "batch_size": 16000,
    "position_style": "continuous",
    "random_seed": 11,
    "shuffle_pool_size": 65536,
    "sort_pool_size": 0
  },
  "Tokenizer": {
    "tokenizer": "SentencePieceTokenizer",
    "vocab_path": "./package/dialog_cn/vocab.txt",
    "specials_path": "",
    "do_lower_case": false,
    "spm_model_file": "./package/dialog_cn/spm.model"
  }
}
    +==============================================================================+
    |                                                                              |
    |                         DistributedStrategy Overview                         |
    |                                                                              |
    +==============================================================================+
    |                           amp=True <-> amp_configs                           |
    +------------------------------------------------------------------------------+
    |                     init_loss_scaling                 32768.0                |
    |                    incr_every_n_steps                   1000                 |
    |               decr_every_n_nan_or_inf                    2                   |
    |                            incr_ratio                   2.0                  |
    |                            decr_ratio            0.800000011920929           |
    |              use_dynamic_loss_scaling                   True                 |
    |                         use_pure_fp16                  False                 |
    |                        use_fp16_guard                   True                 |
    +==============================================================================+
    |                        a_sync=True <-> a_sync_configs                        |
    +------------------------------------------------------------------------------+
    |                               k_steps                    -1                  |
    |                     max_merge_var_num                    1                   |
    |                       send_queue_size                    16                  |
    |               independent_recv_thread                  False                 |
    |         min_send_grad_num_before_recv                    1                   |
    |                      thread_pool_size                    1                   |
    |                       send_wait_times                    1                   |
    |               runtime_split_send_recv                  False                 |
    |                        launch_barrier                   True                 |
    |             heter_worker_device_guard                   cpu                  |
    |                        lr_decay_steps                    10                  |
    |                            use_ps_gpu                    0                   |
    +==============================================================================+
    |                    Environment Flags, Communication Flags                    |
    +------------------------------------------------------------------------------+
    |                                  mode                    1                   |
    |                               elastic                  False                 |
    |                                  auto                  False                 |
    |                   sync_nccl_allreduce                   True                 |
    |                         nccl_comm_num                    1                   |
    |            use_hierarchical_allreduce                  False                 |
    |   hierarchical_allreduce_inter_nranks                    1                   |
    |                       sync_batch_norm                  False                 |
    |                   fuse_all_reduce_ops                   True                 |
    |                  fuse_grad_size_in_MB                    32                  |
    |              fuse_grad_size_in_TFLOPS                   50.0                 |
    |               cudnn_exhaustive_search                  False                 |
    |             conv_workspace_size_limit                   512                  |
    |    cudnn_batchnorm_spatial_persistent                  False                 |
    |                        fp16_allreduce                  False                 |
    |               last_comm_group_size_MB                   1.0                  |
    |                find_unused_parameters                  False                 |
    |            without_graph_optimization                  False                 |
    +==============================================================================+
    |                                Build Strategy                                |
    +------------------------------------------------------------------------------+
    |           enable_sequential_execution                  False                 |
    |              fuse_elewise_add_act_ops                  False                 |
    |                       fuse_bn_act_ops                  False                 |
    |              fuse_relu_depthwise_conv                  False                 |
    |                    fuse_broadcast_ops                  False                 |
    |                fuse_all_optimizer_ops                  False                 |
    |                        enable_inplace                  False                 |
    |     enable_backward_optimizer_op_deps                   True                 |
    |                 cache_runtime_context                  False                 |
    |                   fuse_bn_add_act_ops                   True                 |
    |                    enable_auto_fusion                  False                 |
    |                          enable_addto                  False                 |
    +==============================================================================+
    |                              Execution Strategy                              |
    +------------------------------------------------------------------------------+
    |                           num_threads                    4                   |
    |          num_iteration_per_drop_scope                    1                   |
    |                 num_iteration_per_run                    1                   |
    |                    use_thread_barrier                  False                 |
    +==============================================================================+

/home/tiger/.local/lib/python3.7/site-packages/paddle/fluid/layers/math_op_patch.py:322: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/models/unified_transformer.py:143
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/home/tiger/.local/lib/python3.7/site-packages/paddle/fluid/layers/math_op_patch.py:322: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:113
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/home/tiger/.local/lib/python3.7/site-packages/paddle/fluid/layers/math_op_patch.py:322: UserWarning: /opt/tiger/mlx_workspace/Knover-develop/knover/modules/transformer_block.py:214
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/home/tiger/.local/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py:707: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
  "It is recommended to use DistributedStrategy "
W0926 21:17:13.738911  2314 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 11.0
W0926 21:17:13.741997  2314 device_context.cc:422] device: 0, cuDNN Version: 8.0.
W0926 21:17:21.021950  2314 gen_comm_id_helper.cc:120] connect addr=10.130.19.203:6072 failed 1 times with reason: Connection refused retry after 0.5 seconds
W0926 21:17:21.522388  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 1 times with reason: Connection refused retry after 0.5 seconds
W0926 21:17:22.022749  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 2 times with reason: Connection refused retry after 1 seconds
W0926 21:17:23.023144  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 3 times with reason: Connection refused retry after 1.5 seconds
W0926 21:17:24.523504  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 4 times with reason: Connection refused retry after 2 seconds
W0926 21:17:26.523833  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 5 times with reason: Connection refused retry after 2.5 seconds
W0926 21:17:29.024152  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 6 times with reason: Connection refused retry after 3 seconds
W0926 21:17:32.024435  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 7 times with reason: Connection refused retry after 3 seconds
W0926 21:17:35.024766  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 8 times with reason: Connection refused retry after 3 seconds
W0926 21:17:38.025146  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 9 times with reason: Connection refused retry after 3 seconds
W0926 21:17:41.025440  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 10 times with reason: Connection refused retry after 3 seconds
W0926 21:17:44.025799  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 11 times with reason: Connection refused retry after 3 seconds
W0926 21:17:47.026120  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 12 times with reason: Connection refused retry after 3 seconds
W0926 21:17:50.026476  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 13 times with reason: Connection refused retry after 3 seconds
W0926 21:17:53.026827  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 14 times with reason: Connection refused retry after 3 seconds
W0926 21:17:56.027174  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 15 times with reason: Connection refused retry after 3 seconds
W0926 21:17:59.027515  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 16 times with reason: Connection refused retry after 3 seconds
W0926 21:18:02.027846  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 17 times with reason: Connection refused retry after 3 seconds
W0926 21:18:05.028201  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 18 times with reason: Connection refused retry after 3 seconds
W0926 21:18:08.028540  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 19 times with reason: Connection refused retry after 3 seconds
W0926 21:18:11.028873  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 20 times with reason: Connection refused retry after 3 seconds
W0926 21:18:14.029444  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 21 times with reason: Connection refused retry after 3 seconds
W0926 21:18:17.029763  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 22 times with reason: Connection refused retry after 3 seconds
W0926 21:18:20.030058  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 23 times with reason: Connection refused retry after 3 seconds
W0926 21:18:23.030386  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 24 times with reason: Connection refused retry after 3 seconds
W0926 21:18:26.030699  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 25 times with reason: Connection refused retry after 3 seconds
W0926 21:18:29.031008  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 26 times with reason: Connection refused retry after 3 seconds
W0926 21:18:32.031347  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 27 times with reason: Connection refused retry after 3 seconds
W0926 21:18:35.031663  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 28 times with reason: Connection refused retry after 3 seconds
W0926 21:18:38.032002  2314 gen_comm_id_helper.cc:120] connect addr=10.130.17.157:6070 failed 29 times with reason: Connection refused retry after 3 seconds
Training is start.
0 6
[{'tgt_label': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7068>, 'generation_mask': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7340>, 'pos_ids': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7500>, 'token_ids': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7378>, 'tgt_idx': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7148>, 'type_ids': <paddle.fluid.core_avx.LoDTensor object at 0x7f2ba1df7570>}]

最后发现确实是nccl环境问题，需要设置使用rdma通信的环境变量，辛苦帮忙排查问题，我关闭issue了~

你试下用 cuda11版本的 paddle ？
python -m pip install paddlepaddle-gpu==2.1.3.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html

另外，nccl 的so 可以试着在/usr/local/lib/libnccl*这里找

我这边测试，在 cuda 10.1的环境跑2.0.1是可以跑的哈

多机多卡已经跑通了，有个另外的问题想问下，如果用3台机器进行训练，是每台机器都要放全量数据吗，还是需要手动切割数据集为3份

你试下用 cuda11版本的 paddle ？
python -m pip install paddlepaddle-gpu==2.1.3.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html 另外，nccl 的so 可以试着在/usr/local/lib/libnccl*这里找我这边测试，在 cuda 10.1的环境跑2.0.1是可以跑的哈
多机多卡已经跑通了，有个另外的问题想问下，如果用3台机器进行训练，是每台机器都要放全量数据吗，还是需要手动切割数据集为3份

然后paddle支持从hdfs读取数据吗

最后发现确实是nccl环境问题，需要设置使用rdma通信的环境变量，辛苦帮忙排查问题，我关闭issue了~

NCCL 可以设置环境变量看 debug 信息：export NCCL_DEBUG=INFO

多机多卡已经跑通了，有个另外的问题想问下，如果用3台机器进行训练，是每台机器都要放全量数据吗，还是需要手动切割数据集为3份

每台机都放全量数据，knover 会对数据完成自动的切分

然后paddle支持从hdfs读取数据吗

不知道你是不是想直接通过 python 的库来访问 hdfs 的数据（配置 hdfs_name + hdfs_ugi + hdfs_path）？现在这部分还没有支持

最后发现确实是nccl环境问题，需要设置使用rdma通信的环境变量，辛苦帮忙排查问题，我关闭issue了~

NCCL 可以设置环境变量看 debug 信息：export NCCL_DEBUG=INFO

多机多卡已经跑通了，有个另外的问题想问下，如果用3台机器进行训练，是每台机器都要放全量数据吗，还是需要手动切割数据集为3份

每台机都放全量数据，knover 会对数据完成自动的切分

然后paddle支持从hdfs读取数据吗

不知道你是不是想直接通过 python 的库来访问 hdfs 的数据（配置 hdfs_name + hdfs_ugi + hdfs_path）？现在这部分还没有支持

对，tf支持从hdfs读数据集，对于大数据量，每台机器本地都放全量数据感觉有点麻烦

我用的是厂内的工具，直接挂载到本地，所以就没有支持hdfs 的功能，如果需要支持的话，可以先简单地在下面的函数里改一下读入的代码： https://github.com/PaddlePaddle/Knover/blob/ac58d760973cacb163b5dc5e1be0b7c54ca75140/knover/data/dialog_reader.py#L346 参考 Python 库：https://hdfscli.readthedocs.io/en/latest/quickstart.html#python-bindings

PaddlePaddle / Knover

想问下分布式训练有什么特殊设置吗，单机多卡可以跑通，多机多卡可以建立通信但是不报错也不训练 #82