Oneflow-Inc / DLPerf

DeepLearning Framework Performance Profiling Toolkit
Apache License 2.0
278 stars 27 forks source link

tf1.x+ngc关于两机训练的问题 #119

Open zzt941006 opened 3 years ago

zzt941006 commented 3 years ago

在两个服务器上,起了两个容器,然后在里面装好了openmpi之类的通信工具。 简单用horovodrun 命令测试了一下,似乎应该是通的?

horovodrun -np 8  -H localhost:8  -p 10000  echo "233"
2021-01-30 03:50:03.454606: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233
horovodrun -np 8  -H node2:8  -p 10000  echo "233"
2021-01-30 03:51:07.190350: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233
horovodrun -np 8  -H node1:8,node2:8  -p 10000 --start-timeout 100 echo "233"
2021-01-30 03:53:35.059288: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233

接着验了一下horovd run 16卡的,好像也没问题?

horovodrun -np 16  -H node1:8,node2:8  -p 10000 --start-timeout 100 echo "233"
2021-01-30 06:26:32.968745: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,8]<stdout>:233
[1,9]<stdout>:233
[1,10]<stdout>:233
[1,11]<stdout>:233
[1,12]<stdout>:233
[1,13]<stdout>:233
[1,14]<stdout>:233
[1,15]<stdout>:233
[1,0]<stdout>:233
[1,1]<stdout>:233
[1,2]<stdout>:233
[1,3]<stdout>:233
[1,4]<stdout>:233
[1,5]<stdout>:233
[1,6]<stdout>:233
[1,7]<stdout>:233

然后配好了相关路径,执行 bash run_two_node.sh 64 fp16 true 当然,multi xxx.sh里,我加了一个--start-timeout 300, 否则好像会显示超时报错。 这样之后可以正常跑,但是性能明显不对:

[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:35.188454 - Iteration: 1  throughput_train : 487.252 seq/s mlm_loss : 10.4442  nsp_loss : 0.6882  total_loss : 11.1323  avg_loss_step : 11.1323  learning_rate : 0.0  loss_scaler : 67108864
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:37.309305 - Iteration: 1  throughput_train : 482.925 seq/s mlm_loss : 10.4311  nsp_loss : 0.7286  total_loss : 11.1597  avg_loss_step : 11.1597  learning_rate : 0.0  loss_scaler : 33554432
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:39.419278 - Iteration: 1  throughput_train : 485.466 seq/s mlm_loss : 10.4503  nsp_loss : 0.7116  total_loss : 11.1619  avg_loss_step : 11.1619  learning_rate : 0.0  loss_scaler : 16777216
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:41.516706 - Iteration: 1  throughput_train : 488.409 seq/s mlm_loss : 10.4479  nsp_loss : 0.7024  total_loss : 11.1503  avg_loss_step : 11.1503  learning_rate : 0.0  loss_scaler : 8388608
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:43.757067 - Iteration: 1  throughput_train : 457.224 seq/s mlm_loss : 10.4415  nsp_loss : 0.7040  total_loss : 11.1455  avg_loss_step : 11.1455  learning_rate : 0.0  loss_scaler : 4194304
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:45.888906 - Iteration: 1  throughput_train : 480.491 seq/s mlm_loss : 10.4445  nsp_loss : 0.7108  total_loss : 11.1553  avg_loss_step : 11.1553  learning_rate : 0.0  loss_scaler : 2097152
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:47.978569 - Iteration: 1  throughput_train : 490.213 seq/s mlm_loss : 10.4558  nsp_loss : 0.7030  total_loss : 11.1589  avg_loss_step : 11.1589  learning_rate : 0.0  loss_scaler : 1048576
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:50.481595 - Iteration: 1  throughput_train : 409.234 seq/s mlm_loss : 10.4543  nsp_loss : 0.7094  total_loss : 11.1637  avg_loss_step : 11.1637  learning_rate : 0.0  loss_scaler : 524288
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:53.738915 - Iteration: 1  throughput_train : 314.426 seq/s mlm_loss : 10.4676  nsp_loss : 0.7103  total_loss : 11.1779  avg_loss_step : 11.1779  learning_rate : 0.0  loss_scaler : 262144
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:37:57.473934 - Iteration: 1  throughput_train : 274.211 seq/s mlm_loss : 10.4273  nsp_loss : 0.7190  total_loss : 11.1463  avg_loss_step : 11.1463  learning_rate : 0.0  loss_scaler : 131072
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:38:01.415495 - Iteration: 1  throughput_train : 259.842 seq/s mlm_loss : 10.4480  nsp_loss : 0.7205  total_loss : 11.1685  avg_loss_step : 11.1685  learning_rate : 0.0  loss_scaler : 65536
[1,0]<stdout>:Skipping time record for  0  due to checkpoint-saving/warmup overhead
[1,0]<stdout>:DLL 2021-01-30 03:38:03.598933 - Iteration: 1  throughput_train : 469.080 seq/s mlm_loss : 10.4538  nsp_loss : 0.7007  total_loss : 11.1545  avg_loss_step : 11.1545  learning_rate : 0.0  loss_scaler : 32768
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: node1
  Local PID:  197398
  Peer host:  node2
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 217958 on node node2 exited on signal 9 (Killed).
--------------------------------------------------------------------------
link ok
Writting log to ../logs//ngc/tensorflow/bert/bz64/2n8g/bert_b64_fp16_2.log

最后每轮还会打出这些信息,这是出了什么问题么?感觉应该是连接上了,且有结果,是不是就说明没有出现通信卡死的情况?

然后我进一步换了一下跑的参数 改成了跑32bacth的混合精度: bash run_two_node.sh 32 fp16 true, 这次直接出现了报错:

[1,7]<stderr>:   [[node HorovodBroadcast_bert_encoder_layer_9_output_dense_bias_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[1,7]<stderr>:
[1,7]<stderr>:Original stack trace for 'HorovodBroadcast_bert_encoder_layer_9_output_dense_bias_0':
[1,7]<stderr>:  File "/workspace/bert/run_pretraining.py", line 713, in <module>
[1,7]<stderr>:    tf.compat.v1.app.run()
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
[1,7]<stderr>:    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
[1,7]<stderr>:    _run_main(main, args)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
[1,7]<stderr>:    sys.exit(main(argv))
[1,7]<stderr>:  File "/workspace/bert/run_pretraining.py", line 633, in main
[1,7]<stderr>:    estimator.train(input_fn=train_input_fn, hooks=training_hooks, max_steps=FLAGS.num_train_steps)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
[1,7]<stderr>:    loss = self._train_model(input_fn, hooks, saving_listeners)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
[1,7]<stderr>:    return self._train_model_default(input_fn, hooks, saving_listeners)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
[1,7]<stderr>:    saving_listeners)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
[1,7]<stderr>:    log_step_count_steps=log_step_count_steps) as mon_sess:
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
[1,7]<stderr>:    stop_grace_period_secs=stop_grace_period_secs)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
[1,7]<stderr>:    stop_grace_period_secs=stop_grace_period_secs)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
[1,7]<stderr>:    h.begin()
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 240, in begin
[1,7]<stderr>:    self.bcast_op = broadcast_global_variables(self.root_rank)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 191, in broadcast_global_variables
[1,7]<stderr>:    return broadcast_variables(_global_variables(), root_rank)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 56, in broadcast_variables
[1,7]<stderr>:    return broadcast_group(variables, root_rank)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
[1,7]<stderr>:    for var in variables])
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/functions.py", line 42, in <listcomp>
[1,7]<stderr>:    for var in variables])
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 198, in broadcast
[1,7]<stderr>:    ignore_name_scope=ignore_name_scope)
[1,7]<stderr>:  File "<string>", line 320, in horovod_broadcast
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
[1,7]<stderr>:    op_def=op_def)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
[1,7]<stderr>:    return func(*args, **kwargs)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
[1,7]<stderr>:    attrs, op_def, compute_device)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py",[1,7]<stderr>: line 3426, in _create_op_internal
[1,7]<stderr>:    op_def=op_def)
[1,7]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
[1,7]<stderr>:    self._traceback = tf_stack.extract_stack()
[1,6]<stderr>:Traceback (most recent call last):
[1,6]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
[1,6]<stderr>:    return fn(*args)
[1,6]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
[1,6]<stderr>:    target_list, run_metadata)
[1,6]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
[1,6]<stderr>:    run_metadata)

很奇怪 是啥姿势不对么?

Flowingsun007 commented 3 years ago

从上面给出的log来看,throughput_train : 487.252 seq/s似乎是单卡的速度,说明并没有成功开启2机16卡,正常开启的话速度在5000+ seq/s;另外从输出:

An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

上看,应该是horovod调用mpi时,部分进程失败启动导致的。建议docker容器内ssh通过10000端口互联,以确保联通性,以及确保通信时10000端口未被占用。

zzt941006 commented 3 years ago

从上面给出的log来看,throughput_train : 487.252 seq/s似乎是单卡的速度,说明并没有成功开启2机16卡,正常开启的话速度在5000+ seq/s;另外从输出:

An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

上看,应该是horovod调用mpi时,部分进程失败启动导致的。建议docker容器内ssh通过10000端口互联,以确保联通性,以及确保通信时10000端口未被占用。

查了一下,之前报错应该是进程失败(卡被别人占了),但是在node1 和 node2上: +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 82490 C python3 15283MiB | | 1 82491 C python3 15307MiB | | 2 82492 C python3 15283MiB | | 3 82493 C python3 15307MiB | | 4 82494 C python3 15307MiB | | 5 82495 C python3 15307MiB | | 6 82496 C python3 15307MiB | | 7 82499 C python3 15307MiB | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 5993 C python3 15101MiB | | 1 5994 C python3 15101MiB | | 2 5995 C python3 15101MiB | | 3 5996 C python3 15101MiB | | 4 5997 C python3 15101MiB | | 5 5998 C python3 15101MiB | | 6 5999 C python3 15101MiB | | 7 6000 C python3 15101MiB | +-----------------------------------------------------------------------------+ 进程都是我起的,说明应该启动了2机16卡,其次比如我跑BS=32的混合精度: python3 /workspace/bert/run_pretraining.py --input_files_dir=/workspace/xxx/BERT/dataset/training/tfrecord --eval_files_dir=/workspace/xxx/BERT/dataset/training/tfrecord --output_dir=/results/tf_bert_pretraining_adam_base_fp16_gbs256_210201013148 --bert_config_file=/workspace/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json --do_train=True --do_eval=False --train_batch_size=32 --eval_batch_size=32 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=120 --num_warmup_steps=10000 --num_accumulation_steps=1 --save_checkpoints_steps=10000 --learning_rate=1e-4 --optimizer_type=adam --horovod --use_fp16 --use_xla --allreduce_post_accumulation=False. 也跑起来了,就是性能差异巨大,感觉不像是多机不通的问题,也不是多卡没起来的问题? 是不是我的这个执行命令还有啥问题?

zzt941006 commented 3 years ago

以下为正常训练一次的日志主要内容:

[1,6]<stderr>:INFO:tensorflow:Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,6]<stderr>:  visible_device_list: "6"
[1,6]<stderr>:}
[1,6]<stderr>:graph_options {
[1,6]<stderr>:  optimizer_options {
[1,6]<stderr>:    global_jit_level: ON_1
[1,6]<stderr>:  }
[1,6]<stderr>:  rewrite_options {
[1,6]<stderr>:    memory_optimization: NO_MEM_OPT
[1,6]<stderr>:  }
[1,6]<stderr>:}
[1,6]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb4e14cd588>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,6]<stderr>:I0202 01:06:15.481197 140421715228480 estimator.py:212] Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,6]<stderr>:  visible_device_list: "6"
[1,6]<stderr>:}
[1,6]<stderr>:graph_options {
[1,6]<stderr>:  optimizer_options {
[1,6]<stderr>:    global_jit_level: ON_1
[1,6]<stderr>:  }
[1,6]<stderr>:  rewrite_options {
[1,6]<stderr>:    memory_optimization: NO_MEM_OPT
[1,6]<stderr>:  }
[1,6]<stderr>:}
[1,6]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb4e14cd588>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,6]<stderr>:WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7fb4e14c9e18>) includes params argument, but params are not passed to Estimator.
[1,6]<stderr>:W0202 01:06:15.481938 140421715228480 model_fn.py:630] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7fb4e14c9e18>) includes params argument, but params are not passed to Estimator.
[1,6]<stderr>:INFO:tensorflow:***** Running training *****
[1,6]<stderr>:I0202 01:06:15.482410 140421715228480 run_pretraining.py:622] ***** Running training *****
[1,6]<stderr>:INFO:tensorflow:  Batch size = 64
[1,6]<stderr>:I0202 01:06:15.482483 140421715228480 run_pretraining.py:623]   Batch size = 64
[1,0]<stderr>:INFO:tensorflow:***** Configuaration *****
[1,0]<stderr>:I0202 01:06:15.489460 140293405398848 run_pretraining.py:578] ***** Configuaration *****
[1,0]<stderr>:INFO:tensorflow:  logtostderr: False
[1,0]<stderr>:I0202 01:06:15.489638 140293405398848 run_pretraining.py:580]   logtostderr: False
[1,0]<stderr>:INFO:tensorflow:  alsologtostderr: False
[1,0]<stderr>:I0202 01:06:15.489709 140293405398848 run_pretraining.py:580]   alsologtostderr: False
[1,0]<stderr>:INFO:tensorflow:  log_dir:
[1,0]<stderr>:I0202 01:06:15.489771 140293405398848 run_pretraining.py:580]   log_dir:
[1,0]<stderr>:INFO:tensorflow:  v: 0
[1,0]<stderr>:I0202 01:06:15.489861 140293405398848 run_pretraining.py:580]   v: 0
[1,0]<stderr>:INFO:tensorflow:  verbosity: 0
[1,0]<stderr>:I0202 01:06:15.489924 140293405398848 run_pretraining.py:580]   verbosity: 0
[1,0]<stderr>:INFO:tensorflow:  logger_levels: {}
[1,0]<stderr>:I0202 01:06:15.489985 140293405398848 run_pretraining.py:580]   logger_levels: {}
[1,0]<stderr>:INFO:tensorflow:  stderrthreshold: fatal
[1,0]<stderr>:I0202 01:06:15.490040 140293405398848 run_pretraining.py:580]   stderrthreshold: fatal
[1,0]<stderr>:INFO:tensorflow:  showprefixforinfo: True
[1,0]<stderr>:I0202 01:06:15.490094 140293405398848 run_pretraining.py:580]   showprefixforinfo: True
[1,0]<stderr>:INFO:tensorflow:  run_with_pdb: False
[1,0]<stderr>:I0202 01:06:15.490149 140293405398848 run_pretraining.py:580]   run_with_pdb: False
[1,0]<stderr>:INFO:tensorflow:  pdb_post_mortem: False
[1,0]<stderr>:I0202 01:06:15.490203 140293405398848 run_pretraining.py:580]   pdb_post_mortem: False
[1,0]<stderr>:INFO:tensorflow:  pdb: False
[1,0]<stderr>:I0202 01:06:15.490258 140293405398848 run_pretraining.py:580]   pdb: False
[1,0]<stderr>:INFO:tensorflow:  run_with_profiling: False
[1,0]<stderr>:I0202 01:06:15.490312 140293405398848 run_pretraining.py:580]   run_with_profiling: False
[1,0]<stderr>:INFO:tensorflow:  profile_file: None
[1,0]<stderr>:I0202 01:06:15.490366 140293405398848 run_pretraining.py:580]   profile_file: None
[1,0]<stderr>:INFO:tensorflow:  use_cprofile_for_profiling: True
[1,0]<stderr>:I0202 01:06:15.490419 140293405398848 run_pretraining.py:580]   use_cprofile_for_profiling: True
[1,1]<stderr>:INFO:tensorflow:Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,1]<stderr>:  visible_device_list: "1"
[1,1]<stderr>:}
[1,1]<stderr>:graph_options {
[1,1]<stderr>:  optimizer_options {
[1,1]<stderr>:    global_jit_level: ON_1
[1,1]<stderr>:  }
[1,1]<stderr>:  rewrite_options {
[1,1]<stderr>:    memory_optimization: NO_MEM_OPT
[1,1]<stderr>:  }
[1,1]<stderr>:}
[1,1]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7efd0c56c400>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,0]<stderr>:INFO:tensorflow:  only_check_args: False
[1,0]<stderr>:I0202 01:06:15.490473 140293405398848 run_pretraining.py:580]   only_check_args: False
[1,1]<stderr>:I0202 01:06:15.489926 139632162826048 estimator.py:212] Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,1]<stderr>:  visible_device_list: "1"
[1,1]<stderr>:}
[1,1]<stderr>:graph_options {
[1,1]<stderr>:  optimizer_options {
[1,1]<stderr>:    global_jit_level: ON_1
[1,1]<stderr>:  }
[1,1]<stderr>:  rewrite_options {
[1,1]<stderr>:    memory_optimization: NO_MEM_OPT
[1,1]<stderr>:  }
[1,1]<stderr>:}
[1,1]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7efd0c56c400>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,0]<stderr>:INFO:tensorflow:  op_conversion_fallback_to_while_loop: False
[1,0]<stderr>:I0202 01:06:15.490526 140293405398848 run_pretraining.py:580]   op_conversion_fallback_to_while_loop: False
[1,0]<stderr>:INFO:tensorflow:  test_random_seed: 301
[1,0]<stderr>:I0202 01:06:15.490580 140293405398848 run_pretraining.py:580]   test_random_seed: 301
[1,0]<stderr>:INFO:tensorflow:  test_srcdir:
[1,3]<stderr>:INFO:tensorflow:Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,3]<stderr>:  visible_device_list: "3"
[1,3]<stderr>:}
[1,3]<stderr>:graph_options {
[1,3]<stderr>:  optimizer_options {
[1,3]<stderr>:    global_jit_level: ON_1
[1,3]<stderr>:  }
[1,3]<stderr>:  rewrite_options {
[1,3]<stderr>:    memory_optimization: NO_MEM_OPT
[1,3]<stderr>:  }
[1,3]<stderr>:}
[1,3]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f98f49914a8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,0]<stderr>:I0202 01:06:15.490633 140293405398848 run_pretraining.py:580]   test_srcdir:
[1,0]<stderr>:INFO:tensorflow:  test_tmpdir: /tmp/absl_testing
[1,0]<stderr>:I0202 01:06:15.490686 140293405398848 run_pretraining.py:580]   test_tmpdir: /tmp/absl_testing
[1,3]<stderr>:I0202 01:06:15.490062 140301779154752 estimator.py:212] Using config: {'_model_dir': '/results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609', '_tf_random_seed': None, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
[1,3]<stderr>:  visible_device_list: "3"
[1,3]<stderr>:}
[1,3]<stderr>:graph_options {
[1,3]<stderr>:  optimizer_options {
[1,3]<stderr>:    global_jit_level: ON_1
[1,3]<stderr>:  }
[1,3]<stderr>:  rewrite_options {
[1,3]<stderr>:    memory_optimization: NO_MEM_OPT
[1,3]<stderr>:  }
[1,3]<stderr>:}
[1,3]<stderr>:, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 10000, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f98f49914a8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[1,0]<stderr>:INFO:tensorflow:  test_randomize_ordering_seed:
[1,0]<stderr>:I0202 01:06:15.490740 140293405398848 run_pretraining.py:580]   test_randomize_ordering_seed:
[1,0]<stderr>:INFO:tensorflow:  xml_output_file:
[1,0]<stderr>:I0202 01:06:15.490793 140293405398848 run_pretraining.py:580]   xml_output_file:
[1,0]<stderr>:INFO:tensorflow:  bert_config_file: /workspace/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json
[1,0]<stderr>:I0202 01:06:15.490847 140293405398848 run_pretraining.py:580]   bert_config_file: /workspace/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json
[1,0]<stderr>:INFO:tensorflow:  input_files_dir: /workspace/XXX/BERT/dataset/training/tfrecord
[1,0]<stderr>:I0202 01:06:15.490902 140293405398848 run_pretraining.py:580]   input_files_dir: /workspace/XXX/BERT/dataset/training/tfrecord
[1,0]<stderr>:INFO:tensorflow:  eval_files_dir: /workspace/XXX/BERT/dataset/training/tfrecord
[1,1]<stderr>:WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7efd0c565e18>) includes params argument, but params are not passed to Estimator.
[1,0]<stderr>:I0202 01:06:15.490956 140293405398848 run_pretraining.py:580]   eval_files_dir: /workspace/XXX/BERT/dataset/training/tfrecord
[1,1]<stderr>:W0202 01:06:15.490641 139632162826048 model_fn.py:630] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7efd0c565e18>) includes params argument, but params are not passed to Estimator.
[1,0]<stderr>:INFO:tensorflow:  output_dir: /results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609
[1,0]<stderr>:I0202 01:06:15.491009 140293405398848 run_pretraining.py:580]   output_dir: /results/tf_bert_pretraining_adam_base_fp16_gbs512_210202010609
[1,0]<stderr>:INFO:tensorflow:  dllog_path: /results/bert_dllog.json
[1,0]<stderr>:I0202 01:06:15.491062 140293405398848 run_pretraining.py:580]   dllog_path: /results/bert_dllog.json
[1,1]<stderr>:INFO:tensorflow:***** Running training *****
[1,0]<stderr>:INFO:tensorflow:  init_checkpoint: None
[1,1]<stderr>:I0202 01:06:15.491082 139632162826048 run_pretraining.py:622] ***** Running training *****
[1,0]<stderr>:I0202 01:06:15.491116 140293405398848 run_pretraining.py:580]   init_checkpoint: None
[1,0]<stderr>:INFO:tensorflow:  optimizer_type: adam
[1,3]<stderr>:WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f98f498de18>) includes params argument, but params are not passed to Estimator.
[1,0]<stderr>:I0202 01:06:15.491169 140293405398848 run_pretraining.py:580]   optimizer_type: adam
[1,3]<stderr>:W0202 01:06:15.490842 140301779154752 model_fn.py:630] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f98f498de18>) includes params argument, but params are not passed to Estimator.
[1,0]<stderr>:INFO:tensorflow:  max_seq_length: 128
[1,0]<stderr>:I0202 01:06:15.491223 140293405398848 run_pretraining.py:580]   max_seq_length: 128
[1,0]<stderr>:INFO:tensorflow:  max_predictions_per_seq: 20
[1,0]<stderr>:I0202 01:06:15.491277 140293405398848 run_pretraining.py:580]   max_predictions_per_seq: 20
[1,3]<stderr>:INFO:tensorflow:***** Running training *****
[1,3]<stderr>:I0202 01:06:15.491304 140301779154752 run_pretraining.py:622] ***** Running training *****
[1,0]<stderr>:INFO:tensorflow:  do_train: True
[1,0]<stderr>:I0202 01:06:15.491331 140293405398848 run_pretraining.py:580]   do_train: True
[1,0]<stderr>:INFO:tensorflow:  do_eval: False
[1,0]<stderr>:I0202 01:06:15.491385 140293405398848 run_pretraining.py:580]   do_eval: False
[1,0]<stderr>:INFO:tensorflow:  train_batch_size: 64
[1,1]<stderr>:INFO:tensorflow:  Batch size = 64
[1,0]<stderr>:I0202 01:06:15.491438 140293405398848 run_pretraining.py:580]   train_batch_size: 64
[1,1]<stderr>:I0202 01:06:15.491151 139632162826048 run_pretraining.py:623]   Batch size = 64
[1,0]<stderr>:INFO:tensorflow:  eval_batch_size: 32
[1,0]<stderr>:I0202 01:06:15.491491 140293405398848 run_pretraining.py:580]   eval_batch_size: 32
[1,0]<stderr>:INFO:tensorflow:  learning_rate: 0.0001
[1,0]<stderr>:I0202 01:06:15.491549 140293405398848 run_pretraining.py:580]   learning_rate: 0.0001
[1,0]<stderr>:INFO:tensorflow:  num_train_steps: 120
[1,0]<stderr>:I0202 01:06:15.491602 140293405398848 run_pretraining.py:580]   num_train_steps: 120
[1,0]<stderr>:INFO:tensorflow:  num_warmup_steps: 10000
[1,0]<stderr>:I0202 01:06:15.491654 140293405398848 run_pretraining.py:580]   num_warmup_steps: 10000
[1,3]<stderr>:INFO:tensorflow:  Batch size = 64
[1,3]<stderr>:I0202 01:06:15.491374 140301779154752 run_pretraining.py:623]   Batch size = 64
[1,0]<stderr>:INFO:tensorflow:  save_checkpoints_steps: 10000
[1,0]<stderr>:I0202 01:06:15.491708 140293405398848 run_pretraining.py:580]   save_checkpoints_steps: 10000
[1,0]<stderr>:INFO:tensorflow:  display_loss_steps: 10
[1,0]<stderr>:I0202 01:06:15.491761 140293405398848 run_pretraining.py:580]   display_loss_steps: 10
[1,0]<stderr>:INFO:tensorflow:  iterations_per_loop: 1000
[1,0]<stderr>:I0202 01:06:15.491813 140293405398848 run_pretraining.py:580]   iterations_per_loop: 1000
[1,0]<stderr>:INFO:tensorflow:  max_eval_steps: 100
[1,0]<stderr>:I0202 01:06:15.491866 140293405398848 run_pretraining.py:580]   max_eval_steps: 100
[1,0]<stderr>:INFO:tensorflow:  num_accumulation_steps: 1
[1,0]<stderr>:I0202 01:06:15.491919 140293405398848 run_pretraining.py:580]   num_accumulation_steps: 1
[1,0]<stderr>:INFO:tensorflow:  allreduce_post_accumulation: False
[1,0]<stderr>:I0202 01:06:15.491972 140293405398848 run_pretraining.py:580]   allreduce_post_accumulation: False
[1,0]<stderr>:INFO:tensorflow:  verbose_logging: False
[1,0]<stderr>:I0202 01:06:15.492025 140293405398848 run_pretraining.py:580]   verbose_logging: False
[1,0]<stderr>:INFO:tensorflow:  horovod: True
[1,0]<stderr>:I0202 01:06:15.492079 140293405398848 run_pretraining.py:580]   horovod: True
[1,0]<stderr>:INFO:tensorflow:  report_loss: True
[1,0]<stderr>:I0202 01:06:15.492132 140293405398848 run_pretraining.py:580]   report_loss: True
[1,0]<stderr>:INFO:tensorflow:  manual_fp16: False
[1,0]<stderr>:I0202 01:06:15.492185 140293405398848 run_pretraining.py:580]   manual_fp16: False
[1,0]<stderr>:INFO:tensorflow:  use_xla: True
[1,0]<stderr>:I0202 01:06:15.492238 140293405398848 run_pretraining.py:580]   use_xla: True
[1,0]<stderr>:INFO:tensorflow:  use_fp16: True
[1,0]<stderr>:I0202 01:06:15.492292 140293405398848 run_pretraining.py:580]   use_fp16: True
[1,0]<stderr>:INFO:tensorflow:  init_loss_scale: 4294967296
[1,0]<stderr>:I0202 01:06:15.492345 140293405398848 run_pretraining.py:580]   init_loss_scale: 4294967296
[1,0]<stderr>:INFO:tensorflow:  ?: False
[1,0]<stderr>:I0202 01:06:15.492399 140293405398848 run_pretraining.py:580]   ?: False
[1,0]<stderr>:INFO:tensorflow:  help: False
[1,0]<stderr>:I0202 01:06:15.492483 140293405398848 run_pretraining.py:580]   help: False
[1,0]<stderr>:INFO:tensorflow:  helpshort: False
[1,0]<stderr>:I0202 01:06:15.492537 140293405398848 run_pretraining.py:580]   helpshort: False
[1,0]<stderr>:INFO:tensorflow:  helpfull: False
[1,0]<stderr>:I0202 01:06:15.492591 140293405398848 run_pretraining.py:580]   helpfull: False
[1,0]<stderr>:INFO:tensorflow:  helpxml: False
[1,0]<stderr>:I0202 01:06:15.492644 140293405398848 run_pretraining.py:580]   helpxml: False
[1,0]<stderr>:INFO:tensorflow:**************************
[1,0]<stderr>:I0202 01:06:15.492692 140293405398848 run_pretraining.py:581] **************************
[1,0]<stderr>:I0202 01:14:47.355644 140293405398848 run_pretraining.py:641] -----------------------------
[1,0]<stderr>:INFO:tensorflow:Total Training Time = 511.86 for Sentences = 122880
[1,0]<stderr>:I0202 01:14:47.355755 140293405398848 run_pretraining.py:643] Total Training Time = 511.86 for Sentences = 122880
[1,0]<stderr>:INFO:tensorflow:Total Training Time W/O Overhead = 241.08 for Sentences = 99328
[1,0]<stderr>:I0202 01:14:47.355857 140293405398848 run_pretraining.py:645] Total Training Time W/O Overhead = 241.08 for Sentences = 99328
[1,0]<stderr>:INFO:tensorflow:Throughput Average (sentences/sec) with overhead = 240.07
[1,0]<stderr>:I0202 01:14:47.355936 140293405398848 run_pretraining.py:646] Throughput Average (sentences/sec) with overhead = 240.07
[1,0]<stderr>:INFO:tensorflow:Throughput Average (sentences/sec) = 412.01
[1,0]<stderr>:I0202 01:14:47.356030 140293405398848 run_pretraining.py:647] Throughput Average (sentences/sec) = 412.01
[1,0]<stdout>:DLL 2021-02-02 01:14:47.356114 -  throughput_train : 412.007 seq/s
[1,0]<stderr>:INFO:tensorflow:-----------------------------
[1,0]<stderr>:I0202 01:14:47.356315 140293405398848 run_pretraining.py:649] -----------------------------

在有卡且有内存的情况下,均可正常训练,得出结果,只是速度非常慢。对比了一下你们的评测日志,发现基本配置都是一致的。 启动docker时,也注意到了必须要拿host模式启动:

sudo docker run --gpus all -it --net=host  --rm -v /home/xxx/workspace/example_test/cn-wiki-128:/workspace/xxx/BERT/dataset/  3f91f9db07d4

感觉应该按照文档和一些需要注意到坑里去执行了,昨天想办法check了一遍,还是没啥进展。

Flowingsun007 commented 3 years ago

您好,看412seq/s的速度,和我们单机单卡(with fp16 & xla)的速度类似,确认是多卡都跑起来了吗?还有,我们跑的是bert base的pretraing,请确保网络是bert base的配置(检查下是不是跑成了bert large?)

zzt941006 commented 3 years ago

您好,看412seq/s的速度,和我们单机单卡(with fp16 & xla)的速度类似,确认是多卡都跑起来了吗?还有,我们跑的是bert base的pretraing,请确保网络是bert base的配置(检查下是不是跑成了bert large?)

1.多卡肯定是启动了的: | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 207278 C python3 15289MiB | | 1 207279 C python3 15313MiB | | 2 207280 C python3 15289MiB | | 3 207281 C python3 15313MiB | | 4 207282 C python3 15313MiB | | 5 207283 C python3 15313MiB | | 6 207284 C python3 15313MiB | | 7 207285 C python3 15313MiB | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 87780 C python3 15289MiB | | 1 87781 C python3 15313MiB | | 2 87782 C python3 15289MiB | | 3 87783 C python3 15313MiB | | 4 87784 C python3 15313MiB | | 5 87785 C python3 15313MiB | | 6 87786 C python3 15313MiB | | 7 87787 C python3 15313MiB | +-----------------------------------------------------------------------------+ 两台机器的8卡都是我使用。 2.跑的脚本是您方给的:run_two_node.sh

WORKSPACE="/workspace/bert"
DATA_DIR="/workspace/xxx/BERT/dataset/training/tfrecord"
BATCH_SIZE=${1:-32}
DTYPE=${2:-"fp32"}
USE_XLA=${3:-"false"}
NUM_TEST=${4:-5}

GPUS_PER_NODE=8
NODE1=localhost:$GPUS_PER_NODE
NODE2=10.113.216.121:$GPUS_PER_NODE
NODES=${4:-$NODE1,$NODE2}

i=1
while [ $i -le $NUM_TEST ]
do
  bash ${WORKSPACE}/scripts/multi_node_run_pretraining_adam.sh  ${DATA_DIR}  8 ${BATCH_SIZE}  120   $DTYPE   $USE_XLA    $NODES $i
  echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
  let i++
  sleep 20
done

然后multi里改了一下相关的路径:

DATA_DIR=/workspace/xxx/BERT/dataset/training/tfrecord
if [ "$bert_model" = "large" ] ; then
    export BERT_CONFIG=/workspace/data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/bert_config.json
else
     export BERT_CONFIG=/workspace/data/download/google_pretrained_weights/uncased_L-12_H-768_A-12/bert_config.json
    #export BERT_CONFIG=data/bert_config.json
fi

应该跑的也是base,因为uncased_L-24_H-1024_A-16这个再docker里没有。不过我们跑的版本是nvcr.io/nvidia/tensorflow:20.11-tf1-py3镜像, tf具体版本:1.15.4 + nv,更新一些。但是单机单卡多卡全精度混合精度以及不同batch,都基本符合预期。 然后,其实还有一个现象证明多机多卡已经启动:那就是两台机器,有任何一张卡被占用(即使只占用了200M),直接会跑不下去,报OOM啥的,因此,如果只是起了host机的单机单卡,不至于说另一台机器的某一张卡被占用,也直接报错的情况吧? 还有一个现象就是,这个镜像里没有openmpi,所以当时就算能免密登录,跑起来还是有问题因此参考了issue75里,手动装了openmpi,便可以正常执行了。(感觉各种需要注意的点我都排查过了一遍,没啥进展,才来求助的)

Flowingsun007 commented 3 years ago

方便的话,您把脚本/代码/包括模型配置压缩个zip文件,这边帮您看下?可以放这个issue下或发到邮箱zhaoluyang@oneflow.org

zzt941006 commented 3 years ago

方便的话,您把脚本/代码/包括模型配置压缩个zip文件,这边帮您看下?可以放这个issue下或发到邮箱zhaoluyang@oneflow.org

可能比较难,我得先请示一下。想再请问一下,有什么方法能够确认我起的这16张卡,是在并行跑,而非起了16张卡串行再跑么?现在我很怕,16张卡都起了,结果没并行起来而是串行在跑,不过我用horovod起的,感觉应该是并行在跑吧?