NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.58k stars 2.37k forks source link

[QUESTION] Training Llama3 70B on 16 x A100 only achieves low throughput of 20 TFLOPS #1000

Closed ZeroAGI closed 2 months ago

ZeroAGI commented 3 months ago

Your question

Machine: 2 nodes * 8 A100

TP=8 PP=2 DP=1 CP=1 seq_length=4096 micro_batch_size=1 global_batch_size=1

enable recompute activation, flash attention, distribute optimizer

Megatron version: core_v0.7.0

Thanks for you help!

ZeroAGI commented 3 months ago

arguments:

using world size: 16, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 8, pipeline-model-parallel size: 2 
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:Llama3Tokenizer
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.95
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. False
  add_position_embedding .......................... False
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... True
  async_save ...................................... None
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... True
  auto_detect_ckpt_format ......................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ False
  bias_swiglu_fusion .............................. True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  calculate_per_token_loss ........................ False
  check_for_nan_in_loss_and_grad .................. True
  check_weight_hash_across_dp_replicas_interval ... None
  ckpt_assume_constant_structure .................. False
  ckpt_fully_parallel_load ........................ False
  ckpt_fully_parallel_save ........................ False
  ckpt_step ....................................... None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  create_attention_mask_in_dataloader ............. True
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... ['/mnt/project/new/Megatron-LM/my-llama_text_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  ddp_bucket_size ................................. None
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  decoupled_lr .................................... None
  decoupled_min_lr ................................ None
  delay_grad_reduce ............................... True
  delay_param_gather .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  disable_straggler_on_startup .................... False
  dist_ckpt_format ................................ torch_dist
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 60
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  enable_one_logger ............................... False
  encoder_num_layers .............................. 80
  encoder_seq_length .............................. 4096
  end_weight_decay ................................ 0.1
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 0
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... True
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  ffn_hidden_size ................................. 28672
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 1
  gradient_accumulation_fusion .................... True
  group_query_attention ........................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 8192
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.006
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  iteration ....................................... 1
  kv_channels ..................................... 128
  lazy_mpu_init ................................... None
  load ............................................ /mnt/model/megatron/pp2/Meta-Llama-3.1-70B
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 1
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_progress .................................... False
  log_straggler ................................... False
  log_throughput .................................. True
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 6e-05
  lr_decay_iters .................................. 430000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. 0.001
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... False
  max_position_embeddings ......................... 4096
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 6e-06
  mmap_bin_files .................................. True
  mock_data ....................................... False
  moe_aux_loss_coeff .............................. 0.0
  moe_expert_capacity_factor ...................... None
  moe_extended_tp ................................. False
  moe_grouped_gemm ................................ False
  moe_input_jitter_eps ............................ None
  moe_layer_recompute ............................. False
  moe_pad_expert_input_to_capacity ................ False
  moe_per_layer_logging ........................... False
  moe_router_load_balancing_type .................. aux_loss
  moe_router_topk ................................. 2
  moe_token_dispatcher_type ....................... allgather
  moe_token_drop_policy ........................... probs
  moe_z_loss_coeff ................................ None
  nccl_communicator_config_path ................... None
  no_load_optim ................................... True
  no_load_rng ..................................... True
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... RMSNorm
  num_attention_heads ............................. 64
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_dataset_builder_threads ..................... 1
  num_experts ..................................... None
  num_layers ...................................... 80
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 8
  num_workers ..................................... 8
  one_logger_entity ............................... hwinf_dcm
  one_logger_project .............................. e2e-tracking
  one_logger_run_name ............................. None
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. True
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ True
  override_opt_param_scheduler .................... False
  padded_vocab_size ............................... 129024
  params_dtype .................................... torch.float32
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 2
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... rope
  pretrained_checkpoint ........................... None
  profile ......................................... False
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  qk_layernorm .................................... False
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... full
  recompute_method ................................ uniform
  recompute_num_layers ............................ 1
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_attention_gate ............................ 1
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_project_dir ............................... None
  retro_verify_neighbor_count ..................... True
  rotary_interleaved .............................. False
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ /mnt/model/megatron_save/Meta-Llama-3.1-70B
  save_interval ................................... 500
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 4096
  sequence_parallel ............................... True
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  spec ............................................ None
  split ........................................... 969, 30, 1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.1
  straggler_ctrlr_port ............................ 65535
  straggler_minmax_count .......................... 1
  swiglu .......................................... True
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 8
  tensorboard_dir ................................. ./tensorboard
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_mode ....................................... False
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. /mnt/model/Meta-Llama-3.1-70B/original/tokenizer.model
  tokenizer_type .................................. Llama3Tokenizer
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_ag .............................. True
  tp_comm_overlap_cfg ............................. None
  tp_comm_overlap_rs .............................. True
  tp_comm_overlap_rs_dgrad ........................ False
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... 1000
  train_samples ................................... None
  transformer_impl ................................ transformer_engine
  transformer_pipeline_model_parallel_size ........ 2
  untie_embeddings_and_output_weights ............. True
  use_checkpoint_args ............................. True
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_dist_ckpt ................................... False
  use_distributed_optimizer ....................... True
  use_flash_attn .................................. True
  use_mcore_models ................................ True
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. True
  use_tp_pp_dp_mapping ............................ False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... None
  wandb_exp_name .................................. 
  wandb_project ................................... 
  wandb_save_dir .................................. 
  weight_decay .................................... 0.1
  weight_decay_incr_style ......................... constant
  world_size ...................................... 16
  yaml_cfg ........................................ None
-------------------- end of arguments ---------------------

throughput log:

[2024-08-13 08:39:36] iteration        2/    1000 | consumed samples:            2 | elapsed time per iteration (ms): 16842.7 | throughput per GPU (TFLOP/s/GPU): 6.8 | learning rate: 1.395349E-07 | global batch size:     1 | lm loss: 7.351685E+00 | loss scale: 1.0 | grad norm: 1233.491 | number of skipped iterations:   0 | number of nan iterations:   0 |
[Rank 10] (after 2 iterations) memory (MB) | allocated: 67502.37890625 | max allocated: 67502.3798828125 | reserved: 67718.0 | max reserved: 67718.0[Rank 12] (after 2 iterations) memory (MB) | allocated: 67502.37890625 | max allocated: 67502.3798828125 | reserved: 67718.0 | max reserved: 67718.0[Rank 14] (after 2 iterations) memory (MB) | allocated: 67502.37890625 | max allocated: 67502.3798828125 | reserved: 67718.0 | max reserved: 67718.0

[Rank 15] (after 2 iterations) memory (MB) | allocated: 67502.37890625 | max allocated: 67502.3798828125 | reserved: 67718.0 | max reserved: 67718.0
[Rank 8] (after 2 iterations) memory (MB) | allocated: 67502.37890625 | max allocated: 67502.3798828125 | reserved: 67718.0 | max reserved: 67718.0
[Rank 11] (after 2 iterations) memory (MB) | allocated: 67502.37890625 | max allocated: 67502.3798828125 | reserved: 67718.0 | max reserved: 67718.0
[Rank 9] (after 2 iterations) memory (MB) | allocated: 67502.37890625 | max allocated: 67502.3798828125 | reserved: 67718.0 | max reserved: 67718.0
[Rank 13] (after 2 iterations) memory (MB) | allocated: 67502.37890625 | max allocated: 67502.3798828125 | reserved: 67718.0 | max reserved: 67718.0

 [2024-08-13 08:39:42] iteration        3/    1000 | consumed samples:            3 | elapsed time per iteration (ms): 5670.8 | throughput per GPU (TFLOP/s/GPU): 20.3 | learning rate: 2.790698E-07 | global batch size:     1 | lm loss: 1.018593E+01 | loss scale: 1.0 | grad norm: 490.142 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:39:48] iteration        4/    1000 | consumed samples:            4 | elapsed time per iteration (ms): 5665.2 | throughput per GPU (TFLOP/s/GPU): 20.3 | learning rate: 4.186047E-07 | global batch size:     1 | lm loss: 1.057528E+01 | loss scale: 1.0 | grad norm: 2210.661 | number of skipped iterations:   0 | number of nan iterations:   0 |
[2024-08-13 08:39:53] iteration        5/    1000 | consumed samples:            5 | elapsed time per iteration (ms): 5670.9 | throughput per GPU (TFLOP/s/GPU): 20.3 | learning rate: 5.581395E-07 | global batch size:     1 | lm loss: 7.973228E+00 | loss scale: 1.0 | grad norm: 3340.921 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:39:59] iteration        6/    1000 | consumed samples:            6 | elapsed time per iteration (ms): 5640.0 | throughput per GPU (TFLOP/s/GPU): 20.4 | learning rate: 6.976744E-07 | global batch size:     1 | lm loss: 7.505680E+00 | loss scale: 1.0 | grad norm: 167538.108 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:40:05] iteration        7/    1000 | consumed samples:            7 | elapsed time per iteration (ms): 5638.0 | throughput per GPU (TFLOP/s/GPU): 20.4 | learning rate: 8.372093E-07 | global batch size:     1 | lm loss: 7.009795E+00 | loss scale: 1.0 | grad norm: 3628.168 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:40:10] iteration        8/    1000 | consumed samples:            8 | elapsed time per iteration (ms): 5659.3 | throughput per GPU (TFLOP/s/GPU): 20.3 | learning rate: 9.767442E-07 | global batch size:     1 | lm loss: 7.338855E+00 | loss scale: 1.0 | grad norm: 1017.101 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:40:16] iteration        9/    1000 | consumed samples:            9 | elapsed time per iteration (ms): 5661.4 | throughput per GPU (TFLOP/s/GPU): 20.3 | learning rate: 1.116279E-06 | global batch size:     1 | lm loss: 6.990343E+00 | loss scale: 1.0 | grad norm: 10533.145 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:40:21] iteration       10/    1000 | consumed samples:           10 | elapsed time per iteration (ms): 5641.5 | throughput per GPU (TFLOP/s/GPU): 20.4 | learning rate: 1.255814E-06 | global batch size:     1 | lm loss: 6.619144E+00 | loss scale: 1.0 | grad norm: 575.851 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:40:27] iteration       11/    1000 | consumed samples:           11 | elapsed time per iteration (ms): 5676.5 | throughput per GPU (TFLOP/s/GPU): 20.3 | learning rate: 1.395349E-06 | global batch size:     1 | lm loss: 6.464733E+00 | loss scale: 1.0 | grad norm: 137.674 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:40:33] iteration       12/    1000 | consumed samples:           12 | elapsed time per iteration (ms): 5639.9 | throughput per GPU (TFLOP/s/GPU): 20.4 | learning rate: 1.534884E-06 | global batch size:     1 | lm loss: 6.594772E+00 | loss scale: 1.0 | grad norm: 180.320 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:40:38] iteration       13/    1000 | consumed samples:           13 | elapsed time per iteration (ms): 5640.0 | throughput per GPU (TFLOP/s/GPU): 20.4 | learning rate: 1.674419E-06 | global batch size:     1 | lm loss: 8.637538E+00 | loss scale: 1.0 | grad norm: 412.995 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:40:44] iteration       14/    1000 | consumed samples:           14 | elapsed time per iteration (ms): 5642.0 | throughput per GPU (TFLOP/s/GPU): 20.4 | learning rate: 1.813953E-06 | global batch size:     1 | lm loss: 6.271463E+00 | loss scale: 1.0 | grad norm: 954.721 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:40:50] iteration       15/    1000 | consumed samples:           15 | elapsed time per iteration (ms): 5640.8 | throughput per GPU (TFLOP/s/GPU): 20.4 | learning rate: 1.953488E-06 | global batch size:     1 | lm loss: 6.355348E+00 | loss scale: 1.0 | grad norm: 423.392 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:40:55] iteration       16/    1000 | consumed samples:           16 | elapsed time per iteration (ms): 5649.1 | throughput per GPU (TFLOP/s/GPU): 20.4 | learning rate: 2.093023E-06 | global batch size:     1 | lm loss: 6.444792E+00 | loss scale: 1.0 | grad norm: 1207.969 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:41:01] iteration       17/    1000 | consumed samples:           17 | elapsed time per iteration (ms): 5643.2 | throughput per GPU (TFLOP/s/GPU): 20.4 | learning rate: 2.232558E-06 | global batch size:     1 | lm loss: 6.320790E+00 | loss scale: 1.0 | grad norm: 326.189 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:41:07] iteration       18/    1000 | consumed samples:           18 | elapsed time per iteration (ms): 5706.3 | throughput per GPU (TFLOP/s/GPU): 20.2 | learning rate: 2.372093E-06 | global batch size:     1 | lm loss: 5.967429E+00 | loss scale: 1.0 | grad norm: 5419.453 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:41:12] iteration       19/    1000 | consumed samples:           19 | elapsed time per iteration (ms): 5703.5 | throughput per GPU (TFLOP/s/GPU): 20.2 | learning rate: 2.511628E-06 | global batch size:     1 | lm loss: 6.645396E+00 | loss scale: 1.0 | grad norm: 65225.350 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-08-13 08:41:18] iteration       20/    1000 | consumed samples:           20 | elapsed time per iteration (ms): 5642.5 | throughput per GPU (TFLOP/s/GPU): 20.4 | learning rate: 2.651163E-06 | global batch size:     1 | lm loss: 6.230913E+00 | loss scale: 1.0 | grad norm: 547.760 | number of skipped iterations:   0 | number of nan iterations:   0 |