How to load from a saved intermediate checkpoint?

jjzha commented 5 months ago

Hello, thanks for this nice library!

I was wondering if it’s possible to load from an intermediate checkpoint (our servers crashed during continuous pre-training)? We’re running into issues where some command line arguments (e.g. TP and PP) are not loaded correctly from the checkpoint (i.e., the iter_xxxxxx folders).

We're running these arguments:

LOG_ARGS="--log_interval 1 --save_interval 500 --eval_interval 100"
TRAIN_ARGS="--train_iters 50000 --lr_decay_style cosine --lr_warmup_iters 5000 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 4 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=INFO

torchrun $DISTRIBUTED_ARGS finetune.py \
    --tensor_model_parallel_size 4 \
    --pipeline_model_parallel_size 1 \
    --load /mpt/Megatron-LLM/sharded_4/weights/ \
    --save /mpt/Megatron-LLM/sharded_4/weights/ \
    --tensorboard_dir /mpt/Megatron-LLM/checkpoints/weights/tensorboard/ \
    --data_path /mpt/Megatron-LLM/tokenized/_text_document \
    --model_name llama2 \
    --tokenizer_type SentencePieceTokenizer \
    --vocab_file /mpt/Megatron-LLM/megatron/weights/tokenizer.model \
    --bf16 \
    --use_flash_attn \
    --no_bias_gelu_fusion \
    --micro_batch_size 1 \
    --global_batch_size 512 \
    --seq_length 2048 \
    --sequence_parallel \
    --recompute_granularity selective \
    --use_checkpoint_args \
    $COMMON_ARGS $LOG_ARGS $TRAIN_ARGS $LLAMA_ARGS

Where the latest_checkpointed_iteration.txt number points to the last checkpoint.

Here is a snippet of the log:

Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint  
Setting num_attention_heads to 32 from checkpoint 
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint  
Setting tie_embed_logits to False from checkpoint 
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
> setting tensorboard ...
time to initialize megatron (seconds): 11.061
[after megatron is initialized] datetime: 2024-01-31 19:21:24

Here tensor_model_parallel_size has value 1, which shouldn't be correct and causes GPU OOM.

Is there a way to fix this?

martinjaggi commented 5 months ago

yes in our runs we were able to restart from checkpoints after fully stopping training.

did you check if your checkpoint is single file or still sharded?

jjzha commented 5 months ago

The checkpoint is still sharded and each folder contains the model_optim_rng.pt file. Apart from not reading in the TP properly (I think), the files are larger likely because of the optimizer states et cetera.

I'm not 100% sure if this causes the GPU OOM. But obviously would like to restart the optimizer states from where it crashed.

Here is the full log:

[2024-02-01 08:11:17,294] torch.distributed.run: [WARNING] 
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
using world size: 4, data-parallel-size: 4, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:SentencePieceTokenizer
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... True
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_load ....................................... None
  bf16 ............................................ True
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ False
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_impl ....................................... infer
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 4
  data_path ....................................... ['/mpt/Megatron-LLM/tokenized/_text_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  data_type ....................................... gpt
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 2048
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 100
  eval_iters ...................................... 100
  eval_only ....................................... False
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_signal_handler ............................. False
  ffn_hidden_size ................................. 11008
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_e4m3 ........................................ False
  fp8_hybrid ...................................... False
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 512
  glu_activation .................................. swiglu
  gradient_accumulation_fusion .................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 4096
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  iteration ....................................... 1000
  kv_channels ..................................... 128
  layernorm_epsilon ............................... 1e-05
  lima_dropout .................................... False
  load ............................................ /mpt/Megatron-LLM/sharded_4/weights/
  load_iters ...................................... None
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 1
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.0003
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. None
  lr_warmup_iters ................................. 5000
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_prob ....................................... 0.15
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 4096
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  metrics ......................................... []
  micro_batch_size ................................ 2
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-06
  mmap_warmup ..................................... False
  model_name ...................................... llama2
  model_type ...................................... encoder_or_decoder
  new_tokens ...................................... True
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  num_attention_heads ............................. 32
  num_attention_heads_kv .......................... 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  optimizer ....................................... adam
  override_opt_param_scheduler .................... False
  padded_vocab_size ............................... 32000
  parallel_attn ................................... False
  parallel_layernorm .............................. False
  params_dtype .................................... torch.bfloat16
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... PositionEmbeddingType.rotary
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... selective
  recompute_method ................................ None
  recompute_num_layers ............................ 1
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  rope_scaling_factor ............................. 1.0
  rope_theta ...................................... 10000.0
  sample_rate ..................................... 1.0
  save ............................................ /mpt/Megatron-LLM/sharded_4/weights/
  save_interval ................................... 500
  scalar_loss_mask ................................ 0.0
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_iters ...................................... []
  sliding_window_size ............................. None
  split ........................................... 969, 30, 1
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. /mpt/Megatron-LLM/sharded_4/weights/tensorboard/
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  tie_embed_logits ................................ False
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. SentencePieceTokenizer
  train_data_path ................................. None
  train_iters ..................................... 50000
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  use_bias ........................................ False
  use_checkpoint_args ............................. True
  use_checkpoint_opt_param_scheduler .............. True
  use_contiguous_buffers_in_local_ddp ............. True
  use_cpu_initialization .......................... None
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. True
  use_one_sent_docs ............................... False
  use_post_ln ..................................... False
  use_ring_exchange_p2p ........................... False
  use_rms_norm .................................... True
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vocab_extra_ids ................................. 0
  vocab_extra_ids_list ............................ None
  vocab_file ...................................... /mpt/Megatron-LLM/megatron/weights/tokenizer.model
  wandb_api_key ................................... None
  wandb_entity .................................... meditron
  wandb_id ........................................ None
  wandb_logger .................................... False
  wandb_name ...................................... None
  wandb_project ................................... None
  wandb_resume .................................... allow
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 4
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 64
> building SentencePieceTokenizer tokenizer ...
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
 > padded vocab (size: 32005) with 123 dummy tokens (new size: 32128)
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
> setting tensorboard ...
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
time to initialize megatron (seconds): 11.453
[after megatron is initialized] datetime: 2024-02-01 08:11:32 
Building model ...
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:36: UserWarning: Llama is not intended to use bias_dropout_fusion
  warnings.warn("Llama is not intended to use bias_dropout_fusion")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:36: UserWarning: Llama is not intended to use bias_dropout_fusion
  warnings.warn("Llama is not intended to use bias_dropout_fusion")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:36: UserWarning: Llama is not intended to use bias_dropout_fusion
  warnings.warn("Llama is not intended to use bias_dropout_fusion")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:36: UserWarning: Llama is not intended to use bias_dropout_fusion
  warnings.warn("Llama is not intended to use bias_dropout_fusion")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 6739464192
/usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
/usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
Traceback (most recent call last):
  File "/mpt/Megatron-LLM/Megatron-LLM/finetune.py", line 268, in <module>
Traceback (most recent call last):
  File "/mpt/Megatron-LLM/Megatron-LLM/finetune.py", line 268, in <module>
    pretrain(args, data_provider, model_provider,  ModelType.encoder_or_decoder,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 108, in pretrain
    pretrain(args, data_provider, model_provider,  ModelType.encoder_or_decoder,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 108, in pretrain
    model, optimizer, opt_param_scheduler = _setup_model_and_optimizer(
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 364, in _setup_model_and_optimizer
    model, optimizer, opt_param_scheduler = _setup_model_and_optimizer(
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 364, in _setup_model_and_optimizer
    optimizer = get_megatron_optimizer(model, no_wd_decay_cond,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/__init__.py", line 128, in get_megatron_optimizer
    optimizer = get_megatron_optimizer(model, no_wd_decay_cond,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/__init__.py", line 128, in get_megatron_optimizer
/usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
/usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
    return opt_ty(optimizer,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/optimizer.py", line 534, in __init__
    return opt_ty(optimizer,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/optimizer.py", line 534, in __init__
        main_param = param.detach().clone().float()main_param = param.detach().clone().float()

torch.cudatorch.cuda..OutOfMemoryErrorOutOfMemoryError: : CUDA out of memory. Tried to allocate 502.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 301.88 MiB is free. Process 1608604 has 39.09 GiB memory in use. Of the allocated memory 38.40 GiB is allocated by PyTorch, and 1.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 502.00 MiB. GPU 2 has a total capacty of 39.39 GiB of which 293.88 MiB is free. Process 1608606 has 39.09 GiB memory in use. Of the allocated memory 38.40 GiB is allocated by PyTorch, and 1.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Traceback (most recent call last):
  File "/mpt/Megatron-LLM/Megatron-LLM/finetune.py", line 268, in <module>
Traceback (most recent call last):
  File "/mpt/Megatron-LLM/Megatron-LLM/finetune.py", line 268, in <module>
    pretrain(args, data_provider, model_provider,  ModelType.encoder_or_decoder,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 108, in pretrain
    pretrain(args, data_provider, model_provider,  ModelType.encoder_or_decoder,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 108, in pretrain
    model, optimizer, opt_param_scheduler = _setup_model_and_optimizer(
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 364, in _setup_model_and_optimizer
    model, optimizer, opt_param_scheduler = _setup_model_and_optimizer(
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 364, in _setup_model_and_optimizer
    optimizer = get_megatron_optimizer(model, no_wd_decay_cond,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/__init__.py", line 128, in get_megatron_optimizer
    optimizer = get_megatron_optimizer(model, no_wd_decay_cond,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/__init__.py", line 128, in get_megatron_optimizer
    return opt_ty(optimizer,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/optimizer.py", line 534, in __init__
    return opt_ty(optimizer,
  File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/optimizer.py", line 534, in __init__
    main_param = param.detach().clone().float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 502.00 MiB. GPU 1 has a total capacty of 39.39 GiB of which 293.88 MiB is free. Process 1608605 has 39.09 GiB memory in use. Of the allocated memory 38.40 GiB is allocated by PyTorch, and 1.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    main_param = param.detach().clone().float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 502.00 MiB. GPU 3 has a total capacty of 39.39 GiB of which 333.88 MiB is free. Process 1608607 has 39.05 GiB memory in use. Of the allocated memory 38.40 GiB is allocated by PyTorch, and 1.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-02-01 08:11:37,317] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 8393) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+b5021ba', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-02-01_08:11:37
  host      : 66f4a2a358a8
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 8394)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-02-01_08:11:37
  host      : 66f4a2a358a8
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 8395)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-02-01_08:11:37
  host      : 66f4a2a358a8
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 8396)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-01_08:11:37
  host      : 66f4a2a358a8
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 8393)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

We tried both using and not using the --use_checkpoint_opt_param_scheduler flag.

jjzha commented 5 months ago

Ok, nevermind it was something from our side that we have to keep the save + load directories the same, otherwise, it loads things incorrectly.

We also use the --use_checkpoint_opt_param_scheduler flag explicitly.

Thanks!

epfLLM / Megatron-LLM

How to load from a saved intermediate checkpoint? #95