GCYZSL / MoLA

89 stars 3 forks source link

按照原参数配置跑的模型,评测出的scienqa只有50多分 #3

Closed 2018211801 closed 3 months ago

2018211801 commented 4 months ago

作者你好呀,我在复现的时候出现了问题,可以帮我解决一下吗? 根据log我认为有几个可能的错误: Did not load optimizer and scheduler ***** Checkpoint mian/adapter_model.bin not found

base_model: /cognitive_comp/wangxiaochen/dmodels/Llama-2-7b-hf data_path: /cognitive_comp/wangxiaochen/projects/MoLA_new/data/processed/scienceqa/science_qa.hf output_dir: ./sampled_scienceqa_exp3_2468bs8 batch_size: 128 micro_batch_size: 8 num_epochs: 1 learning_rate: 0.0003 cutoff_len: 256 val_set_size: 1 lora_r: [8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8] number_experts: [2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8] top_k: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj', 'k_proj', 'o_proj', 'gate_proj', 'down_proj', 'up_proj'] train_on_inputs: True add_eos_token: True group_by_length: True wandb_project: MOLA wandb_run_name: exp3_2468_mbs8_scie_norm wandb_watch: all wandb_log_model: true resume_from_checkpoint: mian prompt template: alpaca obalance: False

Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00, 4.44s/it] /home/wangxiaochen/miniconda3/envs/mola/lib/python3.10/site-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead. warnings.warn(

NEW PARAMETERS obalance False TRAINING MOLA Checkpoint mian/adapter_model.bin not found Did not load optimizer and scheduler **

2018211801 commented 4 months ago

是不是训练的epoch打错了呀,是10而不是1

GCYZSL commented 4 months ago

您好,您使用的是测试代码能否正常运行的参数,而不是真正训练的参数,ScienceQA的训练参数在这个参数的下面位置。您可以在readme页面搜索”Training on ScienceQA data“,文字下面的就是ScienceQA的训练参数。此外,您的训练的模型也没有加载进去,需要确认加载路径和您的模型存放路径是否一致。

2018211801 commented 4 months ago

但我不是从checkpint加载训的呀,是直接用base模型load一次性训练的。用了10个epoch现在可以达到90.4了,跟20epoch只差2个点了。

GCYZSL commented 4 months ago

应该是我理解错了。结果可能根据机器有所不同,可以对epoch做一个grid search,从而达到复现结果。谢谢!