HPDL-Group / Merak

Apache License 2.0
69 stars 9 forks source link

self._valid_micro_batch(micro_batch_id) AssertionError when use "shifted_critical_path" train schedule #4

Closed lin88lin8850 closed 1 year ago

lin88lin8850 commented 1 year ago
image

some parameters: num_gpu=8 dp=tp=pp=2 train_schedule="shifted_critical_path"

lucasleesw commented 1 year ago

Hi, thanks for using Merak, could you provides more parameter settings? One possible solution might be using a larger number of microbatch (gradient_accumulation_steps in training args).

lin88lin8850 commented 1 year ago

Hi, thanks for using Merak, could you provides more parameter settings? One possible solution might be using a larger number of microbatch (gradient_accumulation_steps in training args).

MerakArguments( _n_gpu=8, activation_checkpoint_ratio=None, activation_checkpointing=True, adafactor=False, bf16=False, bf16_full_eval=False, cache_name=None, cache_sharding=False, checkpoint_num_layers=1, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_steps=None, evaluation_strategy=IntervalStrategy.NO, finetune=False, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_predivide_factor=1.0, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, hysteresis=2, ignore_data_skip=False, init_method_std=0.02, initial_scale_power=32, input_names=None, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=-1, log_level_replica=-1, log_on_each_node=True, logging_dir=output/runs/Dec06_06-01-25_ubuntu, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=500, logging_strategy=IntervalStrategy.STEPS, loss_scale=0.0, loss_scale_window=1000, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=20, metric_for_best_model=None, min_loss_scale=1, mp_parameters=, no_cuda=False, no_load_optim=False, no_load_rng=False, no_save_optim=False, no_save_rng=False, num_layers=None, num_train_epochs=3.0, output_dir=output, overwrite_output_dir=False, parallel_vocab=True, partition_method=uniform, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, prediction_loss_only=False, prescale_gradients=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, return_logits=False, run_name=output, save=False, save_on_each_node=False, save_steps=500, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=42, seq_length=None, shard_count=None, sharded_ddp=[], skip_memory_metrics=True, split_inputs=False, tf32=None, tp_overlapping_level=0, tpu_metrics_debug=False, tpu_num_cores=None, train_schedule=1f1b, use_legacy_prediction_loop=False, wall_clock_breakdown=False, xpu_backend=None, )

lucasleesw commented 1 year ago

@lin88lin8850 The number of microbatch (gradient_accumulation_steps in training args) is 1 in your setting, which make a inefficient pipeline parallelism, please try to set number of microbatch (gradient_accumulation_steps in training args) to at least the number of pipeline stages. We will add related infos in next commit.

lin88lin8850 commented 1 year ago

@lin88lin8850 The number of microbatch (gradient_accumulation_steps in training args) is 1 in your setting, which make a inefficient pipeline parallelism, please try to set number of microbatch (gradient_accumulation_steps in training args) to at least the number of pipeline stages. We will add related infos in next commit.

hi, lucas, bug still exists! when I set gradient_accumulation_steps = 2, which is same to the pipeline stages

lin88lin8850 commented 1 year ago

@lucasleesw when I set gradient_accumulation_steps = 4 (dp=tp=pp=2), it's ok please add more information in doc in your convenience, thanks!