Open zzt941006 opened 3 years ago
从上面给出的log来看,throughput_train : 487.252 seq/s似乎是单卡的速度,说明并没有成功开启2机16卡,正常开启的话速度在5000+ seq/s;另外从输出:
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
上看,应该是horovod调用mpi时,部分进程失败启动导致的。建议docker容器内ssh通过10000端口互联,以确保联通性,以及确保通信时10000端口未被占用。
从上面给出的log来看,throughput_train : 487.252 seq/s似乎是单卡的速度,说明并没有成功开启2机16卡,正常开启的话速度在5000+ seq/s;另外从输出:
An MPI communication peer process has unexpectedly disconnected. This usually indicates a failure in the peer process (e.g., a crash or otherwise exiting without calling MPI_FINALIZE first).
上看,应该是horovod调用mpi时,部分进程失败启动导致的。建议docker容器内ssh通过10000端口互联,以确保联通性,以及确保通信时10000端口未被占用。
查了一下,之前报错应该是进程失败(卡被别人占了),但是在node1 和 node2上: +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 82490 C python3 15283MiB | | 1 82491 C python3 15307MiB | | 2 82492 C python3 15283MiB | | 3 82493 C python3 15307MiB | | 4 82494 C python3 15307MiB | | 5 82495 C python3 15307MiB | | 6 82496 C python3 15307MiB | | 7 82499 C python3 15307MiB | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 5993 C python3 15101MiB | | 1 5994 C python3 15101MiB | | 2 5995 C python3 15101MiB | | 3 5996 C python3 15101MiB | | 4 5997 C python3 15101MiB | | 5 5998 C python3 15101MiB | | 6 5999 C python3 15101MiB | | 7 6000 C python3 15101MiB | +-----------------------------------------------------------------------------+ 进程都是我起的,说明应该启动了2机16卡,其次比如我跑BS=32的混合精度: python3 /workspace/bert/run_pretraining.py --input_files_dir=/workspace/xxx/BERT/dataset/training/tfrecord --eval_files_dir=/workspace/xxx/BERT/dataset/training/tfrecord --output_dir=/results/tf_bert_pretraining_adam_base_fp16_gbs256_210201013148 --bert_config_file=/workspace/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json --do_train=True --do_eval=False --train_batch_size=32 --eval_batch_size=32 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=120 --num_warmup_steps=10000 --num_accumulation_steps=1 --save_checkpoints_steps=10000 --learning_rate=1e-4 --optimizer_type=adam --horovod --use_fp16 --use_xla --allreduce_post_accumulation=False. 也跑起来了,就是性能差异巨大,感觉不像是多机不通的问题,也不是多卡没起来的问题? 是不是我的这个执行命令还有啥问题?
以下为正常训练一次的日志主要内容:
[1,6]<stderr>:INFO:tensorflow:Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,6]<stderr>: visible_device_list: "6"
[1,6]<stderr>:}
[1,6]<stderr>:graph_options {
[1,6]<stderr>: optimizer_options {
[1,6]<stderr>: global_jit_level: ON_1
[1,6]<stderr>: }
[1,6]<stderr>: rewrite_options {
[1,6]<stderr>: memory_optimization: NO_MEM_OPT
[1,6]<stderr>: }
[1,6]<stderr>:}
[1,6]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb4e14cd588>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,6]<stderr>:I0202 01:06:15.481197 140421715228480 estimator.py:212] Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,6]<stderr>: visible_device_list: "6"
[1,6]<stderr>:}
[1,6]<stderr>:graph_options {
[1,6]<stderr>: optimizer_options {
[1,6]<stderr>: global_jit_level: ON_1
[1,6]<stderr>: }
[1,6]<stderr>: rewrite_options {
[1,6]<stderr>: memory_optimization: NO_MEM_OPT
[1,6]<stderr>: }
[1,6]<stderr>:}
[1,6]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb4e14cd588>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,6]<stderr>:WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7fb4e14c9e18>) includes params argument, but params are not passed to Estimator.
[1,6]<stderr>:W0202 01:06:15.481938 140421715228480 model_fn.py:630] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7fb4e14c9e18>) includes params argument, but params are not passed to Estimator.
[1,6]<stderr>:INFO:tensorflow:***** Running training *****
[1,6]<stderr>:I0202 01:06:15.482410 140421715228480 run_pretraining.py:622] ***** Running training *****
[1,6]<stderr>:INFO:tensorflow: Batch size = 64
[1,6]<stderr>:I0202 01:06:15.482483 140421715228480 run_pretraining.py:623] Batch size = 64
[1,0]<stderr>:INFO:tensorflow:***** Configuaration *****
[1,0]<stderr>:I0202 01:06:15.489460 140293405398848 run_pretraining.py:578] ***** Configuaration *****
[1,0]<stderr>:INFO:tensorflow: logtostderr: False
[1,0]<stderr>:I0202 01:06:15.489638 140293405398848 run_pretraining.py:580] logtostderr: False
[1,0]<stderr>:INFO:tensorflow: alsologtostderr: False
[1,0]<stderr>:I0202 01:06:15.489709 140293405398848 run_pretraining.py:580] alsologtostderr: False
[1,0]<stderr>:INFO:tensorflow: log_dir:
[1,0]<stderr>:I0202 01:06:15.489771 140293405398848 run_pretraining.py:580] log_dir:
[1,0]<stderr>:INFO:tensorflow: v: 0
[1,0]<stderr>:I0202 01:06:15.489861 140293405398848 run_pretraining.py:580] v: 0
[1,0]<stderr>:INFO:tensorflow: verbosity: 0
[1,0]<stderr>:I0202 01:06:15.489924 140293405398848 run_pretraining.py:580] verbosity: 0
[1,0]<stderr>:INFO:tensorflow: logger_levels: {}
[1,0]<stderr>:I0202 01:06:15.489985 140293405398848 run_pretraining.py:580] logger_levels: {}
[1,0]<stderr>:INFO:tensorflow: stderrthreshold: fatal
[1,0]<stderr>:I0202 01:06:15.490040 140293405398848 run_pretraining.py:580] stderrthreshold: fatal
[1,0]<stderr>:INFO:tensorflow: showprefixforinfo: True
[1,0]<stderr>:I0202 01:06:15.490094 140293405398848 run_pretraining.py:580] showprefixforinfo: True
[1,0]<stderr>:INFO:tensorflow: run_with_pdb: False
[1,0]<stderr>:I0202 01:06:15.490149 140293405398848 run_pretraining.py:580] run_with_pdb: False
[1,0]<stderr>:INFO:tensorflow: pdb_post_mortem: False
[1,0]<stderr>:I0202 01:06:15.490203 140293405398848 run_pretraining.py:580] pdb_post_mortem: False
[1,0]<stderr>:INFO:tensorflow: pdb: False
[1,0]<stderr>:I0202 01:06:15.490258 140293405398848 run_pretraining.py:580] pdb: False
[1,0]<stderr>:INFO:tensorflow: run_with_profiling: False
[1,0]<stderr>:I0202 01:06:15.490312 140293405398848 run_pretraining.py:580] run_with_profiling: False
[1,0]<stderr>:INFO:tensorflow: profile_file: None
[1,0]<stderr>:I0202 01:06:15.490366 140293405398848 run_pretraining.py:580] profile_file: None
[1,0]<stderr>:INFO:tensorflow: use_cprofile_for_profiling: True
[1,0]<stderr>:I0202 01:06:15.490419 140293405398848 run_pretraining.py:580] use_cprofile_for_profiling: True
[1,1]<stderr>:INFO:tensorflow:Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,1]<stderr>: visible_device_list: "1"
[1,1]<stderr>:}
[1,1]<stderr>:graph_options {
[1,1]<stderr>: optimizer_options {
[1,1]<stderr>: global_jit_level: ON_1
[1,1]<stderr>: }
[1,1]<stderr>: rewrite_options {
[1,1]<stderr>: memory_optimization: NO_MEM_OPT
[1,1]<stderr>: }
[1,1]<stderr>:}
[1,1]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7efd0c56c400>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,0]<stderr>:INFO:tensorflow: only_check_args: False
[1,0]<stderr>:I0202 01:06:15.490473 140293405398848 run_pretraining.py:580] only_check_args: False
[1,1]<stderr>:I0202 01:06:15.489926 139632162826048 estimator.py:212] Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,1]<stderr>: visible_device_list: "1"
[1,1]<stderr>:}
[1,1]<stderr>:graph_options {
[1,1]<stderr>: optimizer_options {
[1,1]<stderr>: global_jit_level: ON_1
[1,1]<stderr>: }
[1,1]<stderr>: rewrite_options {
[1,1]<stderr>: memory_optimization: NO_MEM_OPT
[1,1]<stderr>: }
[1,1]<stderr>:}
[1,1]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7efd0c56c400>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,0]<stderr>:INFO:tensorflow: op_conversion_fallback_to_while_loop: False
[1,0]<stderr>:I0202 01:06:15.490526 140293405398848 run_pretraining.py:580] op_conversion_fallback_to_while_loop: False
[1,0]<stderr>:INFO:tensorflow: test_random_seed: 301
[1,0]<stderr>:I0202 01:06:15.490580 140293405398848 run_pretraining.py:580] test_random_seed: 301
[1,0]<stderr>:INFO:tensorflow: test_srcdir:
[1,3]<stderr>:INFO:tensorflow:Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,3]<stderr>: visible_device_list: "3"
[1,3]<stderr>:}
[1,3]<stderr>:graph_options {
[1,3]<stderr>: optimizer_options {
[1,3]<stderr>: global_jit_level: ON_1
[1,3]<stderr>: }
[1,3]<stderr>: rewrite_options {
[1,3]<stderr>: memory_optimization: NO_MEM_OPT
[1,3]<stderr>: }
[1,3]<stderr>:}
[1,3]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f98f49914a8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,0]<stderr>:I0202 01:06:15.490633 140293405398848 run_pretraining.py:580] test_srcdir:
[1,0]<stderr>:INFO:tensorflow: test_tmpdir: /tmp/absl_testing
[1,0]<stderr>:I0202 01:06:15.490686 140293405398848 run_pretraining.py:580] test_tmpdir: /tmp/absl_testing
[1,3]<stderr>:I0202 01:06:15.490062 140301779154752 estimator.py:212] Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,3]<stderr>: visible_device_list: "3"
[1,3]<stderr>:}
[1,3]<stderr>:graph_options {
[1,3]<stderr>: optimizer_options {
[1,3]<stderr>: global_jit_level: ON_1
[1,3]<stderr>: }
[1,3]<stderr>: rewrite_options {
[1,3]<stderr>: memory_optimization: NO_MEM_OPT
[1,3]<stderr>: }
[1,3]<stderr>:}
[1,3]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f98f49914a8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,0]<stderr>:INFO:tensorflow: test_randomize_ordering_seed:
[1,0]<stderr>:I0202 01:06:15.490740 140293405398848 run_pretraining.py:580] test_randomize_ordering_seed:
[1,0]<stderr>:INFO:tensorflow: xml_output_file:
[1,0]<stderr>:I0202 01:06:15.490793 140293405398848 run_pretraining.py:580] xml_output_file:
[1,0]<stderr>:INFO:tensorflow: bert_config_file: /workspace/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json
[1,0]<stderr>:I0202 01:06:15.490847 140293405398848 run_pretraining.py:580] bert_config_file: /workspace/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json
[1,0]<stderr>:INFO:tensorflow: input_files_dir: /workspace/XXX/BERT/dataset/training/tfrecord
[1,0]<stderr>:I0202 01:06:15.490902 140293405398848 run_pretraining.py:580] input_files_dir: /workspace/XXX/BERT/dataset/training/tfrecord
[1,0]<stderr>:INFO:tensorflow: eval_files_dir: /workspace/XXX/BERT/dataset/training/tfrecord
[1,1]<stderr>:WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7efd0c565e18>) includes params argument, but params are not passed to Estimator.
[1,0]<stderr>:I0202 01:06:15.490956 140293405398848 run_pretraining.py:580] eval_files_dir: /workspace/XXX/BERT/dataset/training/tfrecord
[1,1]<stderr>:W0202 01:06:15.490641 139632162826048 model_fn.py:630] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7efd0c565e18>) includes params argument, but params are not passed to Estimator.
[1,0]<stderr>:INFO:tensorflow: output_dir: /results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609
[1,0]<stderr>:I0202 01:06:15.491009 140293405398848 run_pretraining.py:580] output_dir: /results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609
[1,0]<stderr>:INFO:tensorflow: dllog_path: /results/bert_dllog.json
[1,0]<stderr>:I0202 01:06:15.491062 140293405398848 run_pretraining.py:580] dllog_path: /results/bert_dllog.json
[1,1]<stderr>:INFO:tensorflow:***** Running training *****
[1,0]<stderr>:INFO:tensorflow: init_checkpoint: None
[1,1]<stderr>:I0202 01:06:15.491082 139632162826048 run_pretraining.py:622] ***** Running training *****
[1,0]<stderr>:I0202 01:06:15.491116 140293405398848 run_pretraining.py:580] init_checkpoint: None
[1,0]<stderr>:INFO:tensorflow: optimizer_type: adam
[1,3]<stderr>:WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f98f498de18>) includes params argument, but params are not passed to Estimator.
[1,0]<stderr>:I0202 01:06:15.491169 140293405398848 run_pretraining.py:580] optimizer_type: adam
[1,3]<stderr>:W0202 01:06:15.490842 140301779154752 model_fn.py:630] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f98f498de18>) includes params argument, but params are not passed to Estimator.
[1,0]<stderr>:INFO:tensorflow: max_seq_length: 128
[1,0]<stderr>:I0202 01:06:15.491223 140293405398848 run_pretraining.py:580] max_seq_length: 128
[1,0]<stderr>:INFO:tensorflow: max_predictions_per_seq: 20
[1,0]<stderr>:I0202 01:06:15.491277 140293405398848 run_pretraining.py:580] max_predictions_per_seq: 20
[1,3]<stderr>:INFO:tensorflow:***** Running training *****
[1,3]<stderr>:I0202 01:06:15.491304 140301779154752 run_pretraining.py:622] ***** Running training *****
[1,0]<stderr>:INFO:tensorflow: do_train: True
[1,0]<stderr>:I0202 01:06:15.491331 140293405398848 run_pretraining.py:580] do_train: True
[1,0]<stderr>:INFO:tensorflow: do_eval: False
[1,0]<stderr>:I0202 01:06:15.491385 140293405398848 run_pretraining.py:580] do_eval: False
[1,0]<stderr>:INFO:tensorflow: train_batch_size: 64
[1,1]<stderr>:INFO:tensorflow: Batch size = 64
[1,0]<stderr>:I0202 01:06:15.491438 140293405398848 run_pretraining.py:580] train_batch_size: 64
[1,1]<stderr>:I0202 01:06:15.491151 139632162826048 run_pretraining.py:623] Batch size = 64
[1,0]<stderr>:INFO:tensorflow: eval_batch_size: 32
[1,0]<stderr>:I0202 01:06:15.491491 140293405398848 run_pretraining.py:580] eval_batch_size: 32
[1,0]<stderr>:INFO:tensorflow: learning_rate: 0.0001
[1,0]<stderr>:I0202 01:06:15.491549 140293405398848 run_pretraining.py:580] learning_rate: 0.0001
[1,0]<stderr>:INFO:tensorflow: num_train_steps: 120
[1,0]<stderr>:I0202 01:06:15.491602 140293405398848 run_pretraining.py:580] num_train_steps: 120
[1,0]<stderr>:INFO:tensorflow: num_warmup_steps: 10000
[1,0]<stderr>:I0202 01:06:15.491654 140293405398848 run_pretraining.py:580] num_warmup_steps: 10000
[1,3]<stderr>:INFO:tensorflow: Batch size = 64
[1,3]<stderr>:I0202 01:06:15.491374 140301779154752 run_pretraining.py:623] Batch size = 64
[1,0]<stderr>:INFO:tensorflow: save_checkpoints_steps: 10000
[1,0]<stderr>:I0202 01:06:15.491708 140293405398848 run_pretraining.py:580] save_checkpoints_steps: 10000
[1,0]<stderr>:INFO:tensorflow: display_loss_steps: 10
[1,0]<stderr>:I0202 01:06:15.491761 140293405398848 run_pretraining.py:580] display_loss_steps: 10
[1,0]<stderr>:INFO:tensorflow: iterations_per_loop: 1000
[1,0]<stderr>:I0202 01:06:15.491813 140293405398848 run_pretraining.py:580] iterations_per_loop: 1000
[1,0]<stderr>:INFO:tensorflow: max_eval_steps: 100
[1,0]<stderr>:I0202 01:06:15.491866 140293405398848 run_pretraining.py:580] max_eval_steps: 100
[1,0]<stderr>:INFO:tensorflow: num_accumulation_steps: 1
[1,0]<stderr>:I0202 01:06:15.491919 140293405398848 run_pretraining.py:580] num_accumulation_steps: 1
[1,0]<stderr>:INFO:tensorflow: allreduce_post_accumulation: False
[1,0]<stderr>:I0202 01:06:15.491972 140293405398848 run_pretraining.py:580] allreduce_post_accumulation: False
[1,0]<stderr>:INFO:tensorflow: verbose_logging: False
[1,0]<stderr>:I0202 01:06:15.492025 140293405398848 run_pretraining.py:580] verbose_logging: False
[1,0]<stderr>:INFO:tensorflow: horovod: True
[1,0]<stderr>:I0202 01:06:15.492079 140293405398848 run_pretraining.py:580] horovod: True
[1,0]<stderr>:INFO:tensorflow: report_loss: True
[1,0]<stderr>:I0202 01:06:15.492132 140293405398848 run_pretraining.py:580] report_loss: True
[1,0]<stderr>:INFO:tensorflow: manual_fp16: False
[1,0]<stderr>:I0202 01:06:15.492185 140293405398848 run_pretraining.py:580] manual_fp16: False
[1,0]<stderr>:INFO:tensorflow: use_xla: True
[1,0]<stderr>:I0202 01:06:15.492238 140293405398848 run_pretraining.py:580] use_xla: True
[1,0]<stderr>:INFO:tensorflow: use_fp16: True
[1,0]<stderr>:I0202 01:06:15.492292 140293405398848 run_pretraining.py:580] use_fp16: True
[1,0]<stderr>:INFO:tensorflow: init_loss_scale: 4294967296
[1,0]<stderr>:I0202 01:06:15.492345 140293405398848 run_pretraining.py:580] init_loss_scale: 4294967296
[1,0]<stderr>:INFO:tensorflow: ?: False
[1,0]<stderr>:I0202 01:06:15.492399 140293405398848 run_pretraining.py:580] ?: False
[1,0]<stderr>:INFO:tensorflow: help: False
[1,0]<stderr>:I0202 01:06:15.492483 140293405398848 run_pretraining.py:580] help: False
[1,0]<stderr>:INFO:tensorflow: helpshort: False
[1,0]<stderr>:I0202 01:06:15.492537 140293405398848 run_pretraining.py:580] helpshort: False
[1,0]<stderr>:INFO:tensorflow: helpfull: False
[1,0]<stderr>:I0202 01:06:15.492591 140293405398848 run_pretraining.py:580] helpfull: False
[1,0]<stderr>:INFO:tensorflow: helpxml: False
[1,0]<stderr>:I0202 01:06:15.492644 140293405398848 run_pretraining.py:580] helpxml: False
[1,0]<stderr>:INFO:tensorflow:**************************
[1,0]<stderr>:I0202 01:06:15.492692 140293405398848 run_pretraining.py:581] **************************
[1,0]<stderr>:I0202 01:14:47.355644 140293405398848 run_pretraining.py:641] -----------------------------
[1,0]<stderr>:INFO:tensorflow:Total Training Time = 511.86 for Sentences = 122880
[1,0]<stderr>:I0202 01:14:47.355755 140293405398848 run_pretraining.py:643] Total Training Time = 511.86 for Sentences = 122880
[1,0]<stderr>:INFO:tensorflow:Total Training Time W/O Overhead = 241.08 for Sentences = 99328
[1,0]<stderr>:I0202 01:14:47.355857 140293405398848 run_pretraining.py:645] Total Training Time W/O Overhead = 241.08 for Sentences = 99328
[1,0]<stderr>:INFO:tensorflow:Throughput Average (sentences/sec) with overhead = 240.07
[1,0]<stderr>:I0202 01:14:47.355936 140293405398848 run_pretraining.py:646] Throughput Average (sentences/sec) with overhead = 240.07
[1,0]<stderr>:INFO:tensorflow:Throughput Average (sentences/sec) = 412.01
[1,0]<stderr>:I0202 01:14:47.356030 140293405398848 run_pretraining.py:647] Throughput Average (sentences/sec) = 412.01
[1,0]<stdout>:DLL 2021-02-02 01:14:47.356114 - throughput_train : 412.007 seq/s
[1,0]<stderr>:INFO:tensorflow:-----------------------------
[1,0]<stderr>:I0202 01:14:47.356315 140293405398848 run_pretraining.py:649] -----------------------------
在有卡且有内存的情况下,均可正常训练,得出结果,只是速度非常慢。对比了一下你们的评测日志,发现基本配置都是一致的。 启动docker时,也注意到了必须要拿host模式启动:
sudo docker run --gpus all -it --net=host --rm -v /home/xxx/workspace/example_test/cn-wiki-128:/workspace/xxx/BERT/dataset/ 3f91f9db07d4
感觉应该按照文档和一些需要注意到坑里去执行了,昨天想办法check了一遍,还是没啥进展。
您好,看412seq/s的速度,和我们单机单卡(with fp16 & xla)的速度类似,确认是多卡都跑起来了吗?还有,我们跑的是bert base的pretraing,请确保网络是bert base的配置(检查下是不是跑成了bert large?)
您好,看412seq/s的速度,和我们单机单卡(with fp16 & xla)的速度类似,确认是多卡都跑起来了吗?还有,我们跑的是bert base的pretraing,请确保网络是bert base的配置(检查下是不是跑成了bert large?)
1.多卡肯定是启动了的: | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 207278 C python3 15289MiB | | 1 207279 C python3 15313MiB | | 2 207280 C python3 15289MiB | | 3 207281 C python3 15313MiB | | 4 207282 C python3 15313MiB | | 5 207283 C python3 15313MiB | | 6 207284 C python3 15313MiB | | 7 207285 C python3 15313MiB | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 87780 C python3 15289MiB | | 1 87781 C python3 15313MiB | | 2 87782 C python3 15289MiB | | 3 87783 C python3 15313MiB | | 4 87784 C python3 15313MiB | | 5 87785 C python3 15313MiB | | 6 87786 C python3 15313MiB | | 7 87787 C python3 15313MiB | +-----------------------------------------------------------------------------+ 两台机器的8卡都是我使用。 2.跑的脚本是您方给的:run_two_node.sh
WORKSPACE="/workspace/bert"
DATA_DIR="/workspace/xxx/BERT/dataset/training/tfrecord"
BATCH_SIZE=${1:-32}
DTYPE=${2:-"fp32"}
USE_XLA=${3:-"false"}
NUM_TEST=${4:-5}
GPUS_PER_NODE=8
NODE1=localhost:$GPUS_PER_NODE
NODE2=10.113.216.121:$GPUS_PER_NODE
NODES=${4:-$NODE1,$NODE2}
i=1
while [ $i -le $NUM_TEST ]
do
bash ${WORKSPACE}/scripts/multi_node_run_pretraining_adam.sh ${DATA_DIR} 8 ${BATCH_SIZE} 120 $DTYPE $USE_XLA $NODES $i
echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let i++
sleep 20
done
然后multi里改了一下相关的路径:
DATA_DIR=/workspace/xxx/BERT/dataset/training/tfrecord
if [ "$bert_model" = "large" ] ; then
export BERT_CONFIG=/workspace/data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/bert_config.json
else
export BERT_CONFIG=/workspace/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json
#export BERT_CONFIG=data/bert_config.json
fi
应该跑的也是base,因为uncased_L-24_H-1024_A-16这个再docker里没有。不过我们跑的版本是nvcr.io/nvidia/tensorflow:20.11-tf1-py3镜像, tf具体版本:1.15.4 + nv,更新一些。但是单机单卡多卡全精度混合精度以及不同batch,都基本符合预期。 然后,其实还有一个现象证明多机多卡已经启动:那就是两台机器,有任何一张卡被占用(即使只占用了200M),直接会跑不下去,报OOM啥的,因此,如果只是起了host机的单机单卡,不至于说另一台机器的某一张卡被占用,也直接报错的情况吧? 还有一个现象就是,这个镜像里没有openmpi,所以当时就算能免密登录,跑起来还是有问题因此参考了issue75里,手动装了openmpi,便可以正常执行了。(感觉各种需要注意的点我都排查过了一遍,没啥进展,才来求助的)
方便的话,您把脚本/代码/包括模型配置压缩个zip文件,这边帮您看下?可以放这个issue下或发到邮箱zhaoluyang@oneflow.org
方便的话,您把脚本/代码/包括模型配置压缩个zip文件,这边帮您看下?可以放这个issue下或发到邮箱zhaoluyang@oneflow.org
可能比较难,我得先请示一下。想再请问一下,有什么方法能够确认我起的这16张卡,是在并行跑,而非起了16张卡串行再跑么?现在我很怕,16张卡都起了,结果没并行起来而是串行在跑,不过我用horovod起的,感觉应该是并行在跑吧?
在两个服务器上,起了两个容器,然后在里面装好了openmpi之类的通信工具。 简单用horovodrun 命令测试了一下,似乎应该是通的?
接着验了一下horovd run 16卡的,好像也没问题?
然后配好了相关路径,执行 bash run_two_node.sh 64 fp16 true 当然,multi xxx.sh里,我加了一个--start-timeout 300, 否则好像会显示超时报错。 这样之后可以正常跑,但是性能明显不对:
最后每轮还会打出这些信息,这是出了什么问题么?感觉应该是连接上了,且有结果,是不是就说明没有出现通信卡死的情况?
然后我进一步换了一下跑的参数 改成了跑32bacth的混合精度: bash run_two_node.sh 32 fp16 true, 这次直接出现了报错:
很奇怪 是啥姿势不对么?