Open caichaoxiang opened 3 days ago
During training, EMoE does a grid search on seeds (0,1,2) and lr (2e-5, 3e-5, 5e-5), each combination will produce a result. And when grid search ends, a txt
file will be saved at the output dir. You may see something like this:
The filename of this txt
file contains the best learning rate found during training. In test_glue_no_trainer.py
, you should see the follow lines of code, which extracts the best lr
from the filename of the txt
file.
So to find the bug, I think you need to check whether the txt
file is saved successfully during training.
Hello, during the Language training and testing process of EMoE, when I test after training, the following is displayed:
['cola'] Namespace(adaptive_experts=False, add_expert_size=0, aux_loss_weight=0.01, cache_dir='./.cache', capacity_factor=1.5, checkpointing_steps=None, disable_peft=False, expert_repeat=1, gate_noise=1.0, gate_type='top', gradient_accumulation_steps=1, hub_model_id=None, hub_token=None, ignore_mismatched_sizes=False, include_training=False, is_gshard_loss=False, key_gate=False, learning_rates=[2e-05, 3e-05, 5e-05], load_model=None, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, max_expert_num=8, max_length=128, max_train_steps=None, model_name_or_path='/MyData/bert-large-cased', moe_drop=0.1, moe_layers=[10, 11], normalize_one_score_gate=False, num_experts=16, num_train_epochs=10, num_warmup_steps=0, one_score=False, one_score_gate_update_momentum=0.0, output_dir='test', pad_to_max_length=False, per_device_eval_batch_size=32, per_device_train_batch_size=64, push_tohub=False, random cluster=False, random_init_gate=False, report_to='tensorboard', resume_from_checkpoint=None, save_model=False, seeds=[0, 1, 2], source_dir='/MyData/bert-large-cased_save/cola', task_name='cola', to_MoE=False, top_k=4, train_file=None, use_fp1 6=True, use_slow_tokenizer=False, validation_file=None, weight_decay=0.0, with_tracking=True) learn_gate_random_False_repeat16 test No best results found
What is the problem?
As far as I can remember I only changed the following in search_glue_no_trainer.py line 544:
there is an error (*** AttributeError: 'Accelerator' object has no attribute 'use_fp16'), so I changed it to: