[QUESTION] Training GPT2 on V100 is very slow.

SefaZeng commented 1 year ago

Your question I try to train a model with 600M parameters with 250k vocab size. So the model configuration is the same as the pure English model with about 300M parameters. I train the model with 64(88) 32GB V100 and the global_batch_size is 256 with seq_length 2048 which means about 0.5M tokens in a iteration. I find the iteration time is 4.3s per iteration. Is this time for per iteration good or it's slow? If the calculation is correct: `1e12 / 500000 4.3 / 3600 / 24 = 99.53` Does it mean I need 99 days to train a model with only 600M parameters? The training log is as follows:

 iteration      100/  500000 | consumed samples:        25600 | elapsed time per iteration (ms): 4320.1 | learning rate: 3.320E-05 | global batch size:   256 | lm loss: 1.104719E+01 | loss scale: 65536.0 | grad norm: 3.903 | number of skipped iterations:  17 | number of nan iterations:   0 |
 iteration      200/  500000 | consumed samples:        51200 | elapsed time per iteration (ms): 4270.8 | learning rate: 7.320E-05 | global batch size:   256 | lm loss: 7.803114E+00 | loss scale: 65536.0 | grad norm: 2.887 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration      300/  500000 | consumed samples:        76800 | elapsed time per iteration (ms): 4289.1 | learning rate: 1.132E-04 | global batch size:   256 | lm loss: 6.075205E+00 | loss scale: 65536.0 | grad norm: 2.428 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration      400/  500000 | consumed samples:       102400 | elapsed time per iteration (ms): 4261.6 | learning rate: 1.532E-04 | global batch size:   256 | lm loss: 5.437858E+00 | loss scale: 65536.0 | grad norm: 1.191 | number of skipped iterations:   0 | number of nan iterations:   0 |
 iteration      500/  500000 | consumed samples:       128000 | elapsed time per iteration (ms): 4285.1 | learning rate: 1.932E-04 | global batch size:   256 | lm loss: 5.035889E+00 | loss scale: 65536.0 | grad norm: 1.510 | number of skipped iterations:   0 | number of nan iterations:   0 |

SefaZeng commented 1 year ago

The following are the arguments:

------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.95
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_cache_path ................................. None
  data_impl ....................................... mmap
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 16
  data_path ....................................... ['1', '/pile/megatron_bin/pile_00_text_document', '1', '/pile/megatron_bin/pile_01_text_document', '1', '/pile/megatron_bin/pile_02_text_document', '1', '/pile/megatron_bin/pile_03_text_document', '1', '/pile/megatron_bin/pile_04_text_document', '1', '/pile/megatron_bin/pile_05_text_document', '1', '/pile/megatron_bin/pile_07_text_document', '1']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 100
  embedding_path .................................. None
  embedding_weights_in_fp32 ....................... False
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 24
  encoder_seq_length .............................. 2048
  end_weight_decay ................................ 0.1
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  ffn_hidden_size ................................. 4096
  finetune ........................................ False
  fp16 ............................................ True
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_e4m3 ........................................ False
  fp8_hybrid ...................................... False
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 256
  gradient_accumulation_fusion .................... True
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 1024
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.006
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 64
  layernorm_epsilon ............................... 1e-05
  lazy_mpu_init ................................... None
  load ............................................ /Megatron-LM/checkpoints/baseline
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.0003
  lr_decay_iters .................................. 320000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. None
  lr_warmup_iters ................................. 750
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  master_addr ..................................... 11.214.159.213
  master_port ..................................... 32307
  max_position_embeddings ......................... 2048
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 4
  min_loss_scale .................................. 1.0
  min_lr .......................................... 3e-05
  mmap_warmup ..................................... False
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  num_attention_heads ............................. 16
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_p2p_comm ................................ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.float16
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 2
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... rope
  profile ......................................... False
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ 1
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_workdir ................................... None
  rotary_percent .................................. 1.0
  sample_rate ..................................... 1.0
  save ............................................ /Megatron-LM/checkpoints/baseline
  save_interval ................................... 10000
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... True
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  split ........................................... 949,50,1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.1
  swiglu .......................................... False
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 2
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. /Megatron-LM/../mt5/spiece.model
  tokenizer_type .................................. MT5Tokenizer
  train_data_path ................................. None
  train_iters ..................................... 500000
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 2
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_contiguous_buffers_in_local_ddp ............. True
  use_cpu_initialization .......................... None
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. True
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... /mt5/vocab.txt
  vocab_size ...................................... None
  weight_decay .................................... 0.1
  weight_decay_incr_style ......................... constant
  world_size ...................................... 64
-------------------- end of arguments ---------------------

jon-barker commented 1 year ago

My initial reaction is that this might be a reasonable per iteration time for V100 since on A100 it's typical to see per iteration times around 1s. I don't have V100 I can test on unfortunately, but I'll try to replicate the configuration with A100 and let you know what I see.

SefaZeng commented 1 year ago

My initial reaction is that this might be a reasonable per iteration time for V100 since on A100 it's typical to see per iteration times around 1s. I don't have V100 I can test on unfortunately, but I'll try to replicate the configuration with A100 and let you know what I see.

Thank you for your reply! Does this mean training a GPT2 with 350M parameters through 1 trillion tokens(which is a standard config for nowadays LLMs) on 64 32G V100 needs about 90 days? That's a bit of a shock to me...

SefaZeng commented 1 year ago

Another problem is why the memory usage is a bit low while each dataset is 40G and there are 30 shards of data. But the memory usage for each machine is only 16~20 G.

deepakn94 commented 1 year ago

I noticed you are using tensor_model_parallel_size=2 and pipeline_model_parallel_size=2. You shouldn't need these for the scale of model you are training.

I also used the formula in https://arxiv.org/pdf/2104.04473.pdf to estimate the throughput you are observing. It seems to be about 72 256 2048 24 1024 1024 / (4.3 64) = 3.5 Teraflop/s/GPU, which is a very small fraction of peak V100 device throughput (130 Teraflop/s). Something seems wrong here; the first thing I would try is reducing tensor_model_parallel_size and pipeline_model_parallel_size to 1.

github-actions[bot] commented 11 months ago

Marking as stale. No activity in 60 days.

NVIDIA / Megatron-LM

[QUESTION] Training GPT2 on V100 is very slow. #449