Closed SefaZeng closed 2 months ago
The following are the arguments:
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. False
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.95
adam_eps ........................................ 1e-08
add_bias_linear ................................. True
add_position_embedding .......................... True
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
async_tensor_model_parallel_allreduce ........... False
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ False
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_cache_path ................................. None
data_impl ....................................... mmap
data_parallel_random_init ....................... False
data_parallel_size .............................. 16
data_path ....................................... ['1', '/pile/megatron_bin/pile_00_text_document', '1', '/pile/megatron_bin/pile_01_text_document', '1', '/pile/megatron_bin/pile_02_text_document', '1', '/pile/megatron_bin/pile_03_text_document', '1', '/pile/megatron_bin/pile_04_text_document', '1', '/pile/megatron_bin/pile_05_text_document', '1', '/pile/megatron_bin/pile_07_text_document', '1']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_num_layers .............................. None
decoder_seq_length .............................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout_minutes ..................... 100
embedding_path .................................. None
embedding_weights_in_fp32 ....................... False
empty_unused_memory_level ....................... 0
encoder_num_layers .............................. 24
encoder_seq_length .............................. 2048
end_weight_decay ................................ 0.1
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 10
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
ffn_hidden_size ................................. 4096
finetune ........................................ False
fp16 ............................................ True
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_e4m3 ........................................ False
fp8_hybrid ...................................... False
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
global_batch_size ............................... 256
gradient_accumulation_fusion .................... True
group_query_attention ........................... False
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.1
hidden_size ..................................... 1024
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.006
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
kv_channels ..................................... 64
layernorm_epsilon ............................... 1e-05
lazy_mpu_init ................................... None
load ............................................ /Megatron-LM/checkpoints/baseline
local_rank ...................................... None
log_batch_size_to_tensorboard ................... False
log_interval .................................... 100
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 0.0003
lr_decay_iters .................................. 320000
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 750
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... True
master_addr ..................................... 11.214.159.213
master_port ..................................... 32307
max_position_embeddings ......................... 2048
max_tokens_to_oom ............................... 12000
merge_file ...................................... None
micro_batch_size ................................ 4
min_loss_scale .................................. 1.0
min_lr .......................................... 3e-05
mmap_warmup ..................................... False
no_load_optim ................................... None
no_load_rng ..................................... None
no_persist_layer_norm ........................... False
no_save_optim ................................... None
no_save_rng ..................................... None
num_attention_heads ............................. 16
num_channels .................................... 3
num_classes ..................................... 1000
num_experts ..................................... None
num_layers ...................................... 24
num_layers_per_virtual_pipeline_stage ........... None
num_query_groups ................................ 1
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
output_bert_embeddings .......................... False
overlap_p2p_comm ................................ False
override_opt_param_scheduler .................... False
params_dtype .................................... torch.float16
patch_dim ....................................... 16
perform_initialization .......................... True
pipeline_model_parallel_size .................... 2
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... rope
profile ......................................... False
profile_ranks ................................... [0]
profile_step_end ................................ 12
profile_step_start .............................. 10
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ 1
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_return_doc_ids ............................ False
retro_workdir ................................... None
rotary_percent .................................. 1.0
sample_rate ..................................... 1.0
save ............................................ /Megatron-LM/checkpoints/baseline
save_interval ................................... 10000
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 2048
sequence_parallel ............................... True
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
skip_train ...................................... False
split ........................................... 949,50,1
squared_relu .................................... False
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.1
swiglu .......................................... False
swin_backbone_type .............................. tiny
tensor_model_parallel_size ...................... 2
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_model ................................. /Megatron-LM/../mt5/spiece.model
tokenizer_type .................................. MT5Tokenizer
train_data_path ................................. None
train_iters ..................................... 500000
train_samples ................................... None
transformer_impl ................................ local
transformer_pipeline_model_parallel_size ........ 2
untie_embeddings_and_output_weights ............. False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_contiguous_buffers_in_local_ddp ............. True
use_cpu_initialization .......................... None
use_distributed_optimizer ....................... False
use_flash_attn .................................. False
use_one_sent_docs ............................... False
use_ring_exchange_p2p ........................... False
use_rotary_position_embeddings .................. True
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... /mt5/vocab.txt
vocab_size ...................................... None
weight_decay .................................... 0.1
weight_decay_incr_style ......................... constant
world_size ...................................... 64
-------------------- end of arguments ---------------------
My initial reaction is that this might be a reasonable per iteration time for V100 since on A100 it's typical to see per iteration times around 1s. I don't have V100 I can test on unfortunately, but I'll try to replicate the configuration with A100 and let you know what I see.
My initial reaction is that this might be a reasonable per iteration time for V100 since on A100 it's typical to see per iteration times around 1s. I don't have V100 I can test on unfortunately, but I'll try to replicate the configuration with A100 and let you know what I see.
Thank you for your reply! Does this mean training a GPT2 with 350M parameters through 1 trillion tokens(which is a standard config for nowadays LLMs) on 64 32G V100 needs about 90 days? That's a bit of a shock to me...
Another problem is why the memory usage is a bit low while each dataset is 40G and there are 30 shards of data. But the memory usage for each machine is only 16~20 G.
I noticed you are using tensor_model_parallel_size=2
and pipeline_model_parallel_size=2
. You shouldn't need these for the scale of model you are training.
I also used the formula in https://arxiv.org/pdf/2104.04473.pdf to estimate the throughput you are observing. It seems to be about 72 256 2048 24 1024 1024 / (4.3 64) = 3.5 Teraflop/s/GPU, which is a very small fraction of peak V100 device throughput (130 Teraflop/s). Something seems wrong here; the first thing I would try is reducing tensor_model_parallel_size
and pipeline_model_parallel_size
to 1.
Marking as stale. No activity in 60 days.
Your question I try to train a model with 600M parameters with 250k vocab size. So the model configuration is the same as the pure English model with about 300M parameters. I train the model with 64(88) 32GB V100 and the global_batch_size is 256 with seq_length 2048 which means about 0.5M tokens in a iteration. I find the iteration time is 4.3s per iteration. Is this time for per iteration good or it's slow? If the calculation is correct: `1e12 / 500000 4.3 / 3600 / 24 = 99.53` Does it mean I need 99 days to train a model with only 600M parameters? The training log is as follows: