Closed jjzha closed 5 months ago
yes in our runs we were able to restart from checkpoints after fully stopping training.
did you check if your checkpoint is single file or still sharded?
The checkpoint is still sharded and each folder contains the model_optim_rng.pt
file. Apart from not reading in the TP properly (I think), the files are larger likely because of the optimizer states et cetera.
I'm not 100% sure if this causes the GPU OOM. But obviously would like to restart the optimizer states from where it crashed.
Here is the full log:
[2024-02-01 08:11:17,294] torch.distributed.run: [WARNING]
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
using world size: 4, data-parallel-size: 4, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:SentencePieceTokenizer
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. True
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
async_tensor_model_parallel_allreduce ........... True
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... False
barrier_with_L1_time ............................ True
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ False
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... infer
data_parallel_random_init ....................... False
data_parallel_size .............................. 4
data_path ....................................... ['/mpt/Megatron-LLM/tokenized/_text_document']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
data_type ....................................... gpt
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_num_layers .............................. None
decoder_seq_length .............................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
distribute_saved_activations .................... False
distributed_backend ............................. nccl
embedding_path .................................. None
empty_unused_memory_level ....................... 0
encoder_num_layers .............................. 32
encoder_seq_length .............................. 2048
end_weight_decay ................................ 0.01
eod_mask_loss ................................... False
eval_interval ................................... 100
eval_iters ...................................... 100
eval_only ....................................... False
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_signal_handler ............................. False
ffn_hidden_size ................................. 11008
finetune ........................................ False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_e4m3 ........................................ False
fp8_hybrid ...................................... False
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
global_batch_size ............................... 512
glu_activation .................................. swiglu
gradient_accumulation_fusion .................... True
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.1
hidden_size ..................................... 4096
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
iteration ....................................... 1000
kv_channels ..................................... 128
layernorm_epsilon ............................... 1e-05
lima_dropout .................................... False
load ............................................ /mpt/Megatron-LLM/sharded_4/weights/
load_iters ...................................... None
local_rank ...................................... None
log_batch_size_to_tensorboard ................... False
log_interval .................................... 1
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 0.0003
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 5000
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_prob ....................................... 0.15
masked_softmax_fusion ........................... True
max_position_embeddings ......................... 4096
max_tokens_to_oom ............................... 12000
merge_file ...................................... None
metrics ......................................... []
micro_batch_size ................................ 2
min_loss_scale .................................. 1.0
min_lr .......................................... 1e-06
mmap_warmup ..................................... False
model_name ...................................... llama2
model_type ...................................... encoder_or_decoder
new_tokens ...................................... True
no_load_optim ................................... None
no_load_rng ..................................... None
no_persist_layer_norm ........................... False
no_save_optim ................................... None
no_save_rng ..................................... None
num_attention_heads ............................. 32
num_attention_heads_kv .......................... 32
num_channels .................................... 3
num_classes ..................................... 1000
num_layers ...................................... 32
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
optimizer ....................................... adam
override_opt_param_scheduler .................... False
padded_vocab_size ............................... 32000
parallel_attn ................................... False
parallel_layernorm .............................. False
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
perform_initialization .......................... True
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... PositionEmbeddingType.rotary
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... selective
recompute_method ................................ None
recompute_num_layers ............................ 1
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
rope_scaling_factor ............................. 1.0
rope_theta ...................................... 10000.0
sample_rate ..................................... 1.0
save ............................................ /mpt/Megatron-LLM/sharded_4/weights/
save_interval ................................... 500
scalar_loss_mask ................................ 0.0
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 2048
sequence_parallel ............................... False
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
skip_iters ...................................... []
sliding_window_size ............................. None
split ........................................... 969, 30, 1
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.01
tensor_model_parallel_size ...................... 1
tensorboard_dir ................................. /mpt/Megatron-LLM/sharded_4/weights/tensorboard/
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
tie_embed_logits ................................ False
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_model ................................. None
tokenizer_type .................................. SentencePieceTokenizer
train_data_path ................................. None
train_iters ..................................... 50000
train_samples ................................... None
transformer_impl ................................ local
transformer_pipeline_model_parallel_size ........ 1
use_bias ........................................ False
use_checkpoint_args ............................. True
use_checkpoint_opt_param_scheduler .............. True
use_contiguous_buffers_in_local_ddp ............. True
use_cpu_initialization .......................... None
use_distributed_optimizer ....................... False
use_flash_attn .................................. True
use_one_sent_docs ............................... False
use_post_ln ..................................... False
use_ring_exchange_p2p ........................... False
use_rms_norm .................................... True
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vocab_extra_ids ................................. 0
vocab_extra_ids_list ............................ None
vocab_file ...................................... /mpt/Megatron-LLM/megatron/weights/tokenizer.model
wandb_api_key ................................... None
wandb_entity .................................... meditron
wandb_id ........................................ None
wandb_logger .................................... False
wandb_name ...................................... None
wandb_project ................................... None
wandb_resume .................................... allow
weight_decay .................................... 0.01
weight_decay_incr_style ......................... constant
world_size ...................................... 4
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 64
> building SentencePieceTokenizer tokenizer ...
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
> padded vocab (size: 32005) with 123 dummy tokens (new size: 32128)
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 11008 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 4096 from checkpoint
Setting padded_vocab_size to 32000 from checkpoint
Setting position_embedding_type to PositionEmbeddingType.rotary from checkpoint
Setting num_attention_heads_kv to 32 from checkpoint
Setting parallel_attn to False from checkpoint
Setting parallel_layernorm to False from checkpoint
Setting use_rms_norm to True from checkpoint
Setting glu_activation to swiglu from checkpoint
Setting tie_embed_logits to False from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
> setting tensorboard ...
Special tokens: {'<CLS>': 32000, '<SEP>': 32001, '<EOD>': 32002, '<MASK>': 32003, '<PAD>': 32004, '<s>': 1, '</s>': 2}
time to initialize megatron (seconds): 11.453
[after megatron is initialized] datetime: 2024-02-01 08:11:32
Building model ...
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:36: UserWarning: Llama is not intended to use bias_dropout_fusion
warnings.warn("Llama is not intended to use bias_dropout_fusion")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:36: UserWarning: Llama is not intended to use bias_dropout_fusion
warnings.warn("Llama is not intended to use bias_dropout_fusion")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:36: UserWarning: Llama is not intended to use bias_dropout_fusion
warnings.warn("Llama is not intended to use bias_dropout_fusion")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:36: UserWarning: Llama is not intended to use bias_dropout_fusion
warnings.warn("Llama is not intended to use bias_dropout_fusion")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
/mpt/Megatron-LLM/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 6739464192
/usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
self._dummy_overflow_buf = torch.cuda.IntTensor([0])
/usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
self._dummy_overflow_buf = torch.cuda.IntTensor([0])
Traceback (most recent call last):
File "/mpt/Megatron-LLM/Megatron-LLM/finetune.py", line 268, in <module>
Traceback (most recent call last):
File "/mpt/Megatron-LLM/Megatron-LLM/finetune.py", line 268, in <module>
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 108, in pretrain
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 108, in pretrain
model, optimizer, opt_param_scheduler = _setup_model_and_optimizer(
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 364, in _setup_model_and_optimizer
model, optimizer, opt_param_scheduler = _setup_model_and_optimizer(
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 364, in _setup_model_and_optimizer
optimizer = get_megatron_optimizer(model, no_wd_decay_cond,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/__init__.py", line 128, in get_megatron_optimizer
optimizer = get_megatron_optimizer(model, no_wd_decay_cond,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/__init__.py", line 128, in get_megatron_optimizer
/usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
self._dummy_overflow_buf = torch.cuda.IntTensor([0])
/usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
self._dummy_overflow_buf = torch.cuda.IntTensor([0])
return opt_ty(optimizer,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/optimizer.py", line 534, in __init__
return opt_ty(optimizer,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/optimizer.py", line 534, in __init__
main_param = param.detach().clone().float()main_param = param.detach().clone().float()
torch.cudatorch.cuda..OutOfMemoryErrorOutOfMemoryError: : CUDA out of memory. Tried to allocate 502.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 301.88 MiB is free. Process 1608604 has 39.09 GiB memory in use. Of the allocated memory 38.40 GiB is allocated by PyTorch, and 1.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 502.00 MiB. GPU 2 has a total capacty of 39.39 GiB of which 293.88 MiB is free. Process 1608606 has 39.09 GiB memory in use. Of the allocated memory 38.40 GiB is allocated by PyTorch, and 1.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/mpt/Megatron-LLM/Megatron-LLM/finetune.py", line 268, in <module>
Traceback (most recent call last):
File "/mpt/Megatron-LLM/Megatron-LLM/finetune.py", line 268, in <module>
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 108, in pretrain
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 108, in pretrain
model, optimizer, opt_param_scheduler = _setup_model_and_optimizer(
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 364, in _setup_model_and_optimizer
model, optimizer, opt_param_scheduler = _setup_model_and_optimizer(
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/training.py", line 364, in _setup_model_and_optimizer
optimizer = get_megatron_optimizer(model, no_wd_decay_cond,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/__init__.py", line 128, in get_megatron_optimizer
optimizer = get_megatron_optimizer(model, no_wd_decay_cond,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/__init__.py", line 128, in get_megatron_optimizer
return opt_ty(optimizer,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/optimizer.py", line 534, in __init__
return opt_ty(optimizer,
File "/mpt/Megatron-LLM/Megatron-LLM/megatron/optimizer/optimizer.py", line 534, in __init__
main_param = param.detach().clone().float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 502.00 MiB. GPU 1 has a total capacty of 39.39 GiB of which 293.88 MiB is free. Process 1608605 has 39.09 GiB memory in use. Of the allocated memory 38.40 GiB is allocated by PyTorch, and 1.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
main_param = param.detach().clone().float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 502.00 MiB. GPU 3 has a total capacty of 39.39 GiB of which 333.88 MiB is free. Process 1608607 has 39.05 GiB memory in use. Of the allocated memory 38.40 GiB is allocated by PyTorch, and 1.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-02-01 08:11:37,317] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 8393) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.0a0+b5021ba', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-02-01_08:11:37
host : 66f4a2a358a8
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 8394)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-02-01_08:11:37
host : 66f4a2a358a8
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 8395)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-02-01_08:11:37
host : 66f4a2a358a8
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 8396)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-01_08:11:37
host : 66f4a2a358a8
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 8393)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
We tried both using and not using the --use_checkpoint_opt_param_scheduler
flag.
Ok, nevermind it was something from our side that we have to keep the save + load directories the same, otherwise, it loads things incorrectly.
We also use the --use_checkpoint_opt_param_scheduler
flag explicitly.
Thanks!
Hello, thanks for this nice library!
I was wondering if it’s possible to load from an intermediate checkpoint (our servers crashed during continuous pre-training)? We’re running into issues where some command line arguments (e.g. TP and PP) are not loaded correctly from the checkpoint (i.e., the iter_xxxxxx folders).
We're running these arguments:
Where the
latest_checkpointed_iteration.txt
number points to the last checkpoint.Here is a snippet of the log:
Here
tensor_model_parallel_size
has value 1, which shouldn't be correct and causes GPU OOM.Is there a way to fix this?