NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.86k stars 2.23k forks source link

[BUG]Get an AtrributeError when trying to finetune llama3-8B model with multi nodes #937

Open nakroy opened 1 month ago

nakroy commented 1 month ago

Describe the bug I try to finetune llama3-8B model with multi nodes but get an AtrributeError when finishing loading mcore format checkpoint and starting to build datasets, the error is below: AttributeError: '_Llama3Tokenizer' object has no attribute 'unique_identifiers'

To Reproduce

  1. The finetune dataset I use is downloading from https://huggingface.co/datasets/tatsu-lab/alpaca/blob/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet. I convert it into json format after downloading it.

  2. The preprocessing script I used is as follow:

    
    INPUT_FILE=/workspace/dataset/finetune_dataset/train-00000-of-00001-a09b74b3ef9c3b56.json

MODEL_PATH=/workspace/model_weights/llama3-8b TOKENIZER_MODEL=${MODEL_PATH}/original/tokenizer.model OUTPUT_DIR=/workspace/dataset/finetune_dataset/llama3-8b OUTPUT_PREFIX=${OUTPUT_DIR}/alpaca TOKENIZER_TYPE=Llama3Tokenizer

mkdir -p ${OUTPUT_DIR}

python ./tools/preprocess_data.py \ --input ${INPUT_FILE} \ --output-prefix ${OUTPUT_PREFIX} \ --tokenizer-model ${TOKENIZER_MODEL} \ --workers 4 \ --log-interval 1000 \ --tokenizer-type ${TOKENIZER_TYPE} \ --append-eod


3. After preprocessing dataset, I start trainning with 2 nodes, and loading checkpoints successfully, but I met with the problem when building megatron dataset .

**Stack trace/logs**
The error log is shown as follows:
```shell
torchrun --nproc_per_node 8 --nnodes 2 --node_rank 0 --master_addr 10.0.1.6 --master_port 6543 /workspace/megatron/pretrain_gpt.py --tensor-model-parallel-size 8 --pipeline-model-parallel-size 2 --distributed-backend nccl --use-distributed-optimizer --sequence-parallel --overlap-grad-reduce --num-layers 32 --hidden-size 4096 --num-attention-heads 32 --group-query-attention --num-query-groups 8 --ffn-hidden-size 14336 --position-embedding-type rope --use-rotary-position-embeddings --rotary-base 500000 --max-position-embeddings 8192 --make-vocab-size-divisible-by 16128 --norm-epsilon 1e-5 --normalization RMSNorm --swiglu --untie-embeddings-and-output-weights --use-flash-attn --attention-softmax-in-fp32 --log-timers-to-tensorboard --log-validation-ppl-to-tensorboard --log-memory-to-tensorboard --log-interval 1 --attention-dropout 0.0 --hidden-dropout 0.0 --weight-decay 1e-1 --clip-grad 1.0 --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-8 --no-gradient-accumulation-fusion --micro-batch-size 8 --global-batch-size 256 --train-iters 200 --disable-bias-linear --no-bias-gelu-fusion --optimizer adam --recompute-activations --recompute-granularity selective --seed 2024 --init-method-std 0.01 --initial-loss-scale 4096 --lr 1.25e-6 --lr-decay-style cosine --lr-warmup-fraction 0.01 --min-lr 1.25e-7 --weight-decay 1e-1 --load /workspace/model_weights/llama3-8b-tp8-pp2 --finetune --no-load-optim --no-load-rng --save /workspace/megatron_train_result/ckpt/llama3-8B_pretrain_WS16_TP8_PP2 --save-interval 200 --bf16 --eval-interval 100 --eval-iters 10 --data-path /workspace/dataset/finetune_dataset/llama3-8b/alpaca_text_document --split 949,50,1 --seq-length 8192 --num-workers 0 --tokenizer-type Llama3Tokenizer --tokenizer-model /workspace/model_weights/llama3-8b/original/tokenizer.model
using world size: 16, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 8, pipeline-model-parallel size: 2 
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:Llama3Tokenizer
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.95
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. False
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... True
  async_save ...................................... None
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.0
  attention_softmax_in_fp32 ....................... True
  auto_detect_ckpt_format ......................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ True
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ False
  bias_swiglu_fusion .............................. True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  calculate_per_token_loss ........................ False
  check_for_nan_in_loss_and_grad .................. True
  check_weight_hash_across_dp_replicas_interval ... None
  ckpt_assume_constant_structure .................. False
  ckpt_fully_parallel_load ........................ False
  ckpt_fully_parallel_save ........................ False
  ckpt_step ....................................... None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  create_attention_mask_in_dataloader ............. True
  cross_entropy_loss_fusion ....................... False
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... ['/workspace/dataset/finetune_dataset/llama3-8b/alpaca_text_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  ddp_average_in_collective ....................... False
  ddp_bucket_size ................................. None
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  decoupled_lr .................................... None
  decoupled_min_lr ................................ None
  delay_grad_reduce ............................... True
  delay_param_gather .............................. False
  deprecated_use_mcore_models ..................... False
  deterministic_mode .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  disable_straggler_on_startup .................... False
  dist_ckpt_format ................................ torch_dist
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  enable_one_logger ............................... False
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 8192
  end_weight_decay ................................ 0.1
  eod_mask_loss ................................... False
  eval_interval ................................... 100
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  ffn_hidden_size ................................. 14336
  finetune ........................................ True
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 256
  gradient_accumulation_fusion .................... False
  group_query_attention ........................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.0
  hidden_size ..................................... 4096
  hybrid_attention_ratio .......................... 0.0
  hybrid_mlp_ratio ................................ 0.0
  hybrid_override_pattern ......................... None
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.01
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4096.0
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 128
  lazy_mpu_init ................................... None
  load ............................................ /workspace/model_weights/llama3-8b-tp8-pp2
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 1
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... True
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_progress .................................... False
  log_straggler ................................... False
  log_throughput .................................. False
  log_timers_to_tensorboard ....................... True
  log_validation_ppl_to_tensorboard ............... True
  log_world_size_to_tensorboard ................... False
  logging_level ................................... None
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 1.25e-06
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. 0.01
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  lr_wsd_decay_iters .............................. None
  lr_wsd_decay_samples ............................ None
  lr_wsd_decay_style .............................. exponential
  make_vocab_size_divisible_by .................... 16128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 8192
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 8
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1.25e-07
  mmap_bin_files .................................. True
  mock_data ....................................... False
  moe_aux_loss_coeff .............................. 0.0
  moe_expert_capacity_factor ...................... None
  moe_extended_tp ................................. False
  moe_grouped_gemm ................................ False
  moe_input_jitter_eps ............................ None
  moe_layer_recompute ............................. False
  moe_pad_expert_input_to_capacity ................ False
  moe_per_layer_logging ........................... False
  moe_router_load_balancing_type .................. aux_loss
  moe_router_topk ................................. 2
  moe_token_dispatcher_type ....................... allgather
  moe_token_drop_policy ........................... probs
  moe_z_loss_coeff ................................ None
  nccl_communicator_config_path ................... None
  no_load_optim ................................... True
  no_load_rng ..................................... True
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... RMSNorm
  num_attention_heads ............................. 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_dataset_builder_threads ..................... 1
  num_experts ..................................... None
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 8
  num_workers ..................................... 0
  one_logger_entity ............................... hwinf_dcm
  one_logger_project .............................. e2e-tracking
  one_logger_run_name ............................. None
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. True
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.bfloat16
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 2
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... rope
  pretrained_checkpoint ........................... None
  profile ......................................... False
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  qk_layernorm .................................... False
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... selective
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_attention_gate ............................ 1
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_project_dir ............................... None
  retro_verify_neighbor_count ..................... True
  rotary_base ..................................... 500000
  rotary_interleaved .............................. False
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ /workspace/megatron_train_result/ckpt/llama3-8B_pretrain_WS16_TP8_PP2
  save_interval ................................... 200
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 2024
  seq_length ...................................... 8192
  sequence_parallel ............................... True
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  spec ............................................ None
  split ........................................... 949,50,1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.1
  straggler_ctrlr_port ............................ 65535
  straggler_minmax_count .......................... 1
  swiglu .......................................... True
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 8
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_mode ....................................... False
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. /workspace/model_weights/llama3-8b/original/tokenizer.model
  tokenizer_type .................................. Llama3Tokenizer
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_ag .............................. True
  tp_comm_overlap_cfg ............................. None
  tp_comm_overlap_rs .............................. True
  tp_comm_overlap_rs_dgrad ........................ False
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... 200
  train_samples ................................... None
  transformer_impl ................................ transformer_engine
  transformer_pipeline_model_parallel_size ........ 2
  untie_embeddings_and_output_weights ............. True
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_dist_ckpt ................................... False
  use_distributed_optimizer ....................... True
  use_flash_attn .................................. True
  use_legacy_models ............................... False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. True
  use_tp_pp_dp_mapping ............................ False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... None
  wandb_exp_name .................................. 
  wandb_project ................................... 
  wandb_save_dir .................................. 
  weight_decay .................................... 0.1
  weight_decay_incr_style ......................... constant
  world_size ...................................... 16
  yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 32
> building Llama3Tokenizer tokenizer ...
INFO:llama.tokenizer:Reloaded tiktoken model from /workspace/model_weights/llama3-8b/original/tokenizer.model
INFO:llama.tokenizer:#words: 128256 - BOS ID: 128000 - EOS ID: 128001
INFO:llama.tokenizer:Reloaded tiktoken model from /workspace/model_weights/llama3-8b/original/tokenizer.model
INFO:llama.tokenizer:#words: 128256 - BOS ID: 128000 - EOS ID: 128001
INFO:llama.tokenizer:Reloaded tiktoken model from /workspace/model_weights/llama3-8b/original/tokenizer.model
INFO:llama.tokenizer:#words: 128256 - BOS ID: 128000 - EOS ID: 128001
INFO:llama.tokenizer:Reloaded tiktoken model from /workspace/model_weights/llama3-8b/original/tokenizer.model
INFO:llama.tokenizer:#words: 128256 - BOS ID: 128000 - EOS ID: 128001
INFO:llama.tokenizer:Reloaded tiktoken model from /workspace/model_weights/llama3-8b/original/tokenizer.model
INFO:llama.tokenizer:#words: 128256 - BOS ID: 128000 - EOS ID: 128001
 > padded vocab (size: 128256) with 768 dummy tokens (new size: 129024)
> initializing torch distributed ...
INFO:llama.tokenizer:Reloaded tiktoken model from /workspace/model_weights/llama3-8b/original/tokenizer.model
INFO:llama.tokenizer:#words: 128256 - BOS ID: 128000 - EOS ID: 128001
> initialized tensor model parallel with size 8
> initialized pipeline model parallel with size 2
> setting random seeds to 2024 ...
> compiling dataset index builder ...
make: Entering directory '/workspace/megatron/megatron/core/datasets'
INFO:llama.tokenizer:Reloaded tiktoken model from /workspace/model_weights/llama3-8b/original/tokenizer.model
INFO:llama.tokenizer:#words: 128256 - BOS ID: 128000 - EOS ID: 128001
INFO:llama.tokenizer:Reloaded tiktoken model from /workspace/model_weights/llama3-8b/original/tokenizer.model
INFO:llama.tokenizer:#words: 128256 - BOS ID: 128000 - EOS ID: 128001
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/megatron/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.935 seconds
> compiling and loading fused kernels ...
NCCL version 2.21.5+cuda12.5
>>> done with compiling and loading fused kernels. Compilation time: 9.434 seconds
[rank1]:[W719 08:56:38.529363367 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank0]:[W719 08:56:38.529412190 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank2]:[W719 08:56:38.529412186 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank3]:[W719 08:56:38.529511793 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank5]:[W719 08:56:38.529562878 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank4]:[W719 08:56:38.529565198 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank7]:[W719 08:56:38.529627099 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank6]:[W719 08:56:38.529662246 init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
time to initialize megatron (seconds): 14.767
[after megatron is initialized] datetime: 2024-07-19 08:56:42 
building GPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 502398976
 > number of parameters on (tensor, pipeline) model parallel rank (4, 0): 502398976
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 502398976
 > number of parameters on (tensor, pipeline) model parallel rank (5, 0): 502398976
 > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 502398976
 > number of parameters on (tensor, pipeline) model parallel rank (6, 0): 502398976
 > number of parameters on (tensor, pipeline) model parallel rank (2, 0): 502398976
 > number of parameters on (tensor, pipeline) model parallel rank (7, 0): 502398976
INFO:megatron.core.distributed.distributed_data_parallel:Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=True, use_distributed_optimizer=True, check_for_nan_in_grad=True, bucket_size=40000000, average_in_collective=False)
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 9
Params for bucket 1 (49291264 elements):
        module.decoder.layers.15.mlp.linear_fc2.weight
        module.decoder.layers.15.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.15.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.15.mlp.linear_fc1.weight
        module.decoder.layers.15.self_attention.linear_qkv.weight
        module.decoder.layers.14.mlp.linear_fc2.weight
        module.decoder.layers.14.mlp.linear_fc1.weight
        module.decoder.layers.15.self_attention.linear_proj.weight
Params for bucket 2 (54542336 elements):
        module.decoder.layers.14.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.14.self_attention.linear_qkv.weight
        module.decoder.layers.14.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.13.mlp.linear_fc2.weight
        module.decoder.layers.13.mlp.linear_fc1.weight
        module.decoder.layers.13.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.13.self_attention.linear_qkv.weight
        module.decoder.layers.13.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.12.mlp.linear_fc1.weight
        module.decoder.layers.14.self_attention.linear_proj.weight
        module.decoder.layers.13.self_attention.linear_proj.weight
        module.decoder.layers.12.mlp.linear_fc2.weight
Params for bucket 3 (54542336 elements):
        module.decoder.layers.12.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.12.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.11.mlp.linear_fc2.weight
        module.decoder.layers.12.self_attention.linear_qkv.weight
        module.decoder.layers.11.mlp.linear_fc1.weight
        module.decoder.layers.11.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.11.self_attention.linear_qkv.weight
        module.decoder.layers.11.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.10.mlp.linear_fc2.weight
        module.decoder.layers.10.mlp.linear_fc1.weight
        module.decoder.layers.12.self_attention.linear_proj.weight
        module.decoder.layers.11.self_attention.linear_proj.weight
Params for bucket 4 (54542336 elements):
        module.decoder.layers.10.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.10.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.10.self_attention.linear_qkv.weight
        module.decoder.layers.10.self_attention.linear_proj.weight
        module.decoder.layers.9.self_attention.linear_proj.weight
        module.decoder.layers.9.mlp.linear_fc2.weight
        module.decoder.layers.9.mlp.linear_fc1.weight
        module.decoder.layers.9.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.9.self_attention.linear_qkv.weight
        module.decoder.layers.9.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.8.mlp.linear_fc2.weight
        module.decoder.layers.8.mlp.linear_fc1.weight
Params for bucket 5 (54542336 elements):
        module.decoder.layers.8.self_attention.linear_proj.weight
        module.decoder.layers.7.self_attention.linear_proj.weight
        module.decoder.layers.8.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.8.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.8.self_attention.linear_qkv.weight
        module.decoder.layers.7.mlp.linear_fc2.weight
        module.decoder.layers.7.mlp.linear_fc1.weight
        module.decoder.layers.7.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.7.self_attention.linear_qkv.weight
        module.decoder.layers.7.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.6.mlp.linear_fc2.weight
        module.decoder.layers.6.mlp.linear_fc1.weight
Params for bucket 6 (54542336 elements):
        module.decoder.layers.6.self_attention.linear_proj.weight
        module.decoder.layers.5.self_attention.linear_proj.weight
        module.decoder.layers.6.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.5.mlp.linear_fc2.weight
        module.decoder.layers.6.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.6.self_attention.linear_qkv.weight
        module.decoder.layers.5.mlp.linear_fc1.weight
        module.decoder.layers.5.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.5.self_attention.linear_qkv.weight
        module.decoder.layers.5.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.4.mlp.linear_fc2.weight
        module.decoder.layers.4.mlp.linear_fc1.weight
Params for bucket 7 (54542336 elements):
        module.decoder.layers.4.self_attention.linear_qkv.weight
        module.decoder.layers.4.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.3.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.2.mlp.linear_fc2.weight
        module.decoder.layers.2.mlp.linear_fc1.weight
        module.decoder.layers.4.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.4.self_attention.linear_proj.weight
        module.decoder.layers.3.mlp.linear_fc2.weight
        module.decoder.layers.3.mlp.linear_fc1.weight
        module.decoder.layers.3.self_attention.linear_qkv.weight
        module.decoder.layers.3.self_attention.linear_proj.weight
Params for bucket 8 (54542336 elements):
        module.decoder.layers.2.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.2.self_attention.linear_proj.weight
        module.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.2.self_attention.linear_qkv.weight
        module.decoder.layers.1.mlp.linear_fc2.weight
        module.decoder.layers.1.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.1.self_attention.linear_proj.weight
        module.decoder.layers.1.mlp.linear_fc1.weight
        module.decoder.layers.1.self_attention.linear_qkv.weight
        module.decoder.layers.0.mlp.linear_fc2.weight
        module.decoder.layers.0.mlp.linear_fc1.weight
Params for bucket 9 (71311360 elements):
        module.decoder.layers.0.self_attention.linear_proj.weight
        module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight
        module.decoder.layers.0.mlp.linear_fc1.layer_norm_weight
        module.decoder.layers.0.self_attention.linear_qkv.weight
        module.embedding.word_embeddings.weight
INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=1.25e-06, min_lr=1.25e-07, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4096.0, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=True, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x7fca9613fb20>)
> learning rate decay style: cosine
 loading checkpoint from /workspace/model_weights/llama3-8b-tp8-pp2 at iteration 1
could not find arguments in the checkpoint ...
 checkpoint version 3.0
  successfully loaded checkpoint from /workspace/model_weights/llama3-8b-tp8-pp2 [ t 0, p 0 ] at iteration 0
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-07-19 08:56:43 
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      51200
    validation: 7680
    test:       2560
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.949), (0.949, 0.999), (0.999, 1.0)]
> building train, validation, and test datasets for GPT ...
INFO:megatron.core.datasets.blended_megatron_dataset_builder:Building dataset splits with cls=GPTDataset, sizes=(51200, 7680, 2560), and config=GPTDatasetConfig(random_seed=2024, sequence_length=8192, blend=(['/workspace/dataset/finetune_dataset/llama3-8b/alpaca_text_document'], None), blend_per_split=[None, None, None], split='949,50,1', split_matrix=[(0, 0.949), (0.949, 0.999), (0.999, 1.0)], num_dataset_builder_threads=1, path_to_cache=None, mmap_bin_files=True, mock=False, tokenizer=<megatron.training.tokenizer.tokenizer.create_llama3_tokenizer.<locals>._Llama3Tokenizer object at 0x7fca952132b0>, reset_position_ids=False, reset_attention_mask=False, eod_mask_loss=False, create_attention_mask=True, drop_last_partial_validation_sequence=True, add_extra_token_to_sequence=True)
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from /workspace/dataset/finetune_dataset/llama3-8b/alpaca_text_document.idx
INFO:megatron.core.datasets.indexed_dataset:    Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:    Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:    Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 52002
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 52002
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/megatron/pretrain_gpt.py", line 243, in <module>
[rank0]:     pretrain(
[rank0]:   File "/workspace/megatron/megatron/training/training.py", line 251, in pretrain
[rank0]:     = build_train_valid_test_data_iterators(
[rank0]:   File "/workspace/megatron/megatron/training/training.py", line 1467, in build_train_valid_test_data_iterators
[rank0]:     build_train_valid_test_data_loaders(
[rank0]:   File "/workspace/megatron/megatron/training/training.py", line 1428, in build_train_valid_test_data_loaders
[rank0]:     train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
[rank0]:   File "/workspace/megatron/megatron/training/training.py", line 1398, in build_train_valid_test_datasets
[rank0]:     return build_train_valid_test_datasets_provider(train_valid_test_num_samples)
[rank0]:   File "/workspace/megatron/pretrain_gpt.py", line 231, in train_valid_test_datasets_provider
[rank0]:     ).build()
[rank0]:   File "/workspace/megatron/megatron/core/datasets/blended_megatron_dataset_builder.py", line 126, in build
[rank0]:     datasets = self._build_blended_dataset_splits()
[rank0]:   File "/workspace/megatron/megatron/core/datasets/blended_megatron_dataset_builder.py", line 175, in _build_blended_dataset_splits
[rank0]:     return self._build_megatron_dataset_splits(prefixes[0], split, self.sizes)
[rank0]:   File "/workspace/megatron/megatron/core/datasets/blended_megatron_dataset_builder.py", line 408, in _build_megatron_dataset_splits
[rank0]:     self.build_generic_dataset(
[rank0]:   File "/workspace/megatron/megatron/core/datasets/blended_megatron_dataset_builder.py", line 456, in build_generic_dataset
[rank0]:     dataset = cls(*args)
[rank0]:   File "/workspace/megatron/megatron/core/datasets/gpt_dataset.py", line 88, in __init__
[rank0]:     super().__init__(
[rank0]:   File "/workspace/megatron/megatron/core/datasets/megatron_dataset.py", line 61, in __init__
[rank0]:     self.unique_description = json.dumps(
[rank0]:   File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps
[rank0]:     **kw).encode(obj)
[rank0]:   File "/usr/lib/python3.10/json/encoder.py", line 201, in encode
[rank0]:     chunks = list(chunks)
[rank0]:   File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
[rank0]:     yield from _iterencode_dict(o, _current_indent_level)
[rank0]:   File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
[rank0]:     yield from chunks
[rank0]:   File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
[rank0]:     o = _default(o)
[rank0]:   File "/workspace/megatron/megatron/core/datasets/megatron_dataset.py", line 62, in <lambda>
[rank0]:     self.unique_identifiers, indent=4, default=lambda obj: obj.unique_identifiers
[rank0]: AttributeError: '_Llama3Tokenizer' object has no attribute 'unique_identifiers'

Environment (please complete the following information):

NEU-rzh commented 1 month ago

I switch to use 'HuggingFaceTokenizer' as the arg 'tokenizer-type', but there will be some other bugs.

mtian8 commented 1 month ago

The problem is that unique_identifiers is not implemented in Llama3Tokenizer which is not inherited from MegatronTokenizer. Changing the lines https://github.com/NVIDIA/Megatron-LM/blob/9bcd4175becc515331537f0c78eb70079de0eaa8/megatron/training/tokenizer/tokenizer.py#L567-L569 to following should solve the problem:

    class _Llama3Tokenizer(Llama3Tokenizer):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.unique_identifiers = OrderedDict()
            self.unique_identifiers["class"] = type(self).__name__
            self.unique_identifiers["tokenizer_path"] = args if len(args) > 0 else ["n/a"]
            for option in kwargs:
                self.unique_identifiers[option] = str(kwargs[option])

            self.unique_description = json.dumps(self.unique_identifiers, indent=4)
nakroy commented 1 month ago

The problem is that unique_identifiers is not implemented in Llama3Tokenizer which is not inherited from MegatronTokenizer. Changing the lines

https://github.com/NVIDIA/Megatron-LM/blob/9bcd4175becc515331537f0c78eb70079de0eaa8/megatron/training/tokenizer/tokenizer.py#L567-L569

to following should solve the problem:

    class _Llama3Tokenizer(Llama3Tokenizer):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.unique_identifiers = OrderedDict()
            self.unique_identifiers["class"] = type(self).__name__
            self.unique_identifiers["tokenizer_path"] = args if len(args) > 0 else ["n/a"]
            for option in kwargs:
                self.unique_identifiers[option] = str(kwargs[option])

            self.unique_description = json.dumps(self.unique_identifiers, indent=4)

Thanks, it works for me. It seems that Llama3Tokenizer still needs to be fixed with some little problems before it really can be used for finetuning properly...

nakroy commented 1 month ago

I switch to use 'HuggingFaceTokenizer' as the arg 'tokenizer-type', but there will be some other bugs.

I think Llama3Tokenizer is the suitable one for llama3 model training, but it's unstable while using it as far, or maybe the arguments I set is not proper, because I just changed some arguments from the scripts that I used to finetune llama2