SAI990323 / TALLRec

Apache License 2.0
190 stars 31 forks source link

There were missing keys in the checkpoint model loaded #53

Closed KlaineWei closed 4 months ago

KlaineWei commented 5 months ago

您好,我运行tallrec时,在运行到最后时提示这个错误

There were missing keys in the checkpoint model loaded: ['base_model.model.model.embed_tokens.weight', 'base_model.model.model.layers.0.self_attn.q_proj.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.k_proj.weight', 'base_model.model.model.layers.0.self_attn.v_proj.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.o_proj.weight', 'base_model.model.model.layers.0.self_attn.rotary_emb.inv_freq', 'base_model.model.model.layers.0.mlp.gate_proj.weight', 'base_model.model.model.layers.0.mlp.down_proj.weight', 'base_model.model.model.layers.0.mlp.up_proj.weight', 'base_model.model.model.layers.0.input_layernorm.weight', 'base_model.model.model.layers.0.post_attention_layernorm.weight', 'base_model.model.model.layers.1.self_attn.q_proj.weight', 'base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.1.self_attn.k_proj.weight', 'base_model.model.model.layers.1.self_attn.v_proj.weight', 'base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.1.self_attn.o_proj.weight', 'base_model.model.model.layers.1.self_attn.rotary_emb.inv_freq', 'base_model.model.model.layers.1.mlp.gate_proj.weight', 'base_model.model.model.layers.1.mlp.down_proj.weight', 'base_model.model.model.layers.1.mlp.up_proj.weight', 'base_model.model.model.layers.1.input_layernorm.weight', 'base_model.model.model.layers.1.post_attention_layernorm.weight', 'base_model.model.model.layers.2.self_attn.q_proj.weight', 'base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.2.self_attn.k_proj.weight', 'base_model.model.model.layers.2.self_attn.v_proj.weight', 'base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.2.self_attn.o_proj.weight', 'base_model.model.model.layers.2.self_att

我运行的bash脚本配置参数如下: echo $1, $2 seed=$2 output_dir="./results/base/" base_model="daryl149/llama-2-7b-hf" train_data="./data/movie/train.json" val_data="./data/movie/valid.json"、 instruction_model=None

请问是我哪里配错了吗?

97z commented 5 months ago

我也有这个问题,模型训练到了90%自己停了。请问您解决了吗?echo $1, $2 seed=$2 output_dir=/root/Tallrec/TALLRec/save base_model=/root/autodl-tmp/llama7bhf/LLama7B-hf train_data=/root/Tallrec/TALLRec/data/movie/train.json val_data=/root/Tallrec/TALLRec/data/movie/valid.json instruction_model=/root/autodl-tmp/alpaca-lora-7B

KlaineWei commented 5 months ago

没有解决,还在等作者回复

97z commented 5 months ago

我发现在这个bug后面有一句话说If there's a warning about missing keys above, please disregard 。之前训练到epoch180的时候是有一个checkpoint170的,出现这个missing key的时候,这个checkpoint170自动没了。最终的输出文件也没有模型。

KlaineWei commented 5 months ago

我也看到了,应该是最后保存模型出错了……导致也没有训练的结果

97z commented 5 months ago

我看了closed issue,作者提到了is not an error. You have finished the training phrases. We have set an early stop in the code.

97z commented 5 months ago

兄弟,我已经跑通了,训练的结果就是adapter那个文件

KlaineWei commented 5 months ago

请教一下修改了什么地方?

xiaxin1998 commented 5 months ago

我也有这个问题 我训练的时候accuracy是0.67 但是在测试集上就是0.44 试了两个不同的base model都是这样的结果

xiaxin1998 commented 5 months ago

兄弟,我已经跑通了,训练的结果就是adapter那个文件

请问你的accuracy在validation和test上的结果相差大吗

97z commented 5 months ago

请教一下修改了什么地方?

我先改了保存checkpoint的参数,让每隔10个epoch保存一次。然后如果要跑满200个epoch的话你需要把EarlyStopping的参数改大。不过看实验结果,跑满200个epoch会出现过拟合。不用管那个missing key。

97z commented 5 months ago

我也有这个问题 我训练的时候accuracy是0.67 但是在测试集上就是0.44 试了两个不同的base model都是这样的结果

我进行评估的模型是checkpoint200的adapter,已经过拟合了。但是测试集的精确度movie-movie也有64,movie-book有60.我观察训练的log,我看大概第100个epoch的性能比较好,不过我还没尝试。

millenniumbismay commented 4 months ago

I ran the training for 100 epochs for this very reason. There still is a huge gap between val and test AUC which is non-ideal. But then that's what it is now! Check the generated token output of models with k>=256, almost all tokens are 'Yes'. Just because they are improving the AUC on the logits of 'Yes' tokens, the AUC keeps on increasing but the generated text doesn't improve.

SAI990323 commented 4 months ago

Thank you all for your attention. The replies to the questions under this issue are as follows:

  1. You do not need to worry about the missingkeys.
  2. If the performance on the test set is not ideal, you can check whether the lora parameters were not loaded during testing (because different peft versions may cause it to randomly initialize a lora for inference) The results should be reproducible, and several works have already demonstrated reproducible results. Please carefully check the compatibility of the environment and code, as the code was initially used when all packages did not have particularly stable versions.
  3. The output of the model may be influenced by the training distribution.
xiaxin1998 commented 4 months ago

我也有这个问题 我训练的时候accuracy是0.67 但是在测试集上就是0.44 试了两个不同的base model都是这样的结果

我进行评估的模型是checkpoint200的adapter,已经过拟合了。但是测试集的精确度movie-movie也有64,movie-book有60.我观察训练的log,我看大概第100个epoch的性能比较好,不过我还没尝试。

我尝试训练100次也还是同样的结果 测试集很差 而且我还尝试了用别的base model也是这样

SAI990323 commented 4 months ago
  • you can check whether the lora parameters were not loaded during testing (because different peft versions may cause it to randomly initialize a lora for inference) The results should be reproducible, and several works have already demonstrated reproducible results. Please carefully check the compatibility of the environment and code, as the code was initially used when all packages did not have particularly stable versions.
  • The output of the model may be influenced by the training distribution.

我也有这个问题 我训练的时候accuracy是0.67 但是在测试集上就是0.44 试了两个不同的base model都是这样的结果

我进行评估的模型是checkpoint200的adapter,已经过拟合了。但是测试集的精确度movie-movie也有64,movie-book有60.我观察训练的log,我看大概第100个epoch的性能比较好,不过我还没尝试。

我尝试训练100次也还是同样的结果 测试集很差 而且我还尝试了用别的base model也是这样

你好,请问你有检查测试的时候读取的lora文件(adapter.bin)是否正常吗?

xiaxin1998 commented 4 months ago
  • you can check whether the lora parameters were not loaded during testing (because different peft versions may cause it to randomly initialize a lora for inference) The results should be reproducible, and several works have already demonstrated reproducible results. Please carefully check the compatibility of the environment and code, as the code was initially used when all packages did not have particularly stable versions.
  • The output of the model may be influenced by the training distribution.

我也有这个问题 我训练的时候accuracy是0.67 但是在测试集上就是0.44 试了两个不同的base model都是这样的结果

我进行评估的模型是checkpoint200的adapter,已经过拟合了。但是测试集的精确度movie-movie也有64,movie-book有60.我观察训练的log,我看大概第100个epoch的性能比较好,不过我还没尝试。

我尝试训练100次也还是同样的结果 测试集很差 而且我还尝试了用别的base model也是这样

你好,请问你有检查测试的时候读取的lora文件(adapter.bin)是否正常吗?

检查了 在模型保存前后的lora weights是一致的 保存后的bin文件大小也是对的

SAI990323 commented 4 months ago
  • you can check whether the lora parameters were not loaded during testing (because different peft versions may cause it to randomly initialize a lora for inference) The results should be reproducible, and several works have already demonstrated reproducible results. Please carefully check the compatibility of the environment and code, as the code was initially used when all packages did not have particularly stable versions.
  • The output of the model may be influenced by the training distribution.

我也有这个问题 我训练的时候accuracy是0.67 但是在测试集上就是0.44 试了两个不同的base model都是这样的结果

我进行评估的模型是checkpoint200的adapter,已经过拟合了。但是测试集的精确度movie-movie也有64,movie-book有60.我观察训练的log,我看大概第100个epoch的性能比较好,不过我还没尝试。

我尝试训练100次也还是同样的结果 测试集很差 而且我还尝试了用别的base model也是这样

你好,请问你有检查测试的时候读取的lora文件(adapter.bin)是否正常吗?

检查了 在模型保存前后的lora weights是一致的 保存后的bin文件大小也是对的

那请问您load模型成功了吗?指load模型以后load的参数确实是保存的参数,而不是随机的一个参数(

xiaxin1998 commented 4 months ago
  • you can check whether the lora parameters were not loaded during testing (because different peft versions may cause it to randomly initialize a lora for inference) The results should be reproducible, and several works have already demonstrated reproducible results. Please carefully check the compatibility of the environment and code, as the code was initially used when all packages did not have particularly stable versions.
  • The output of the model may be influenced by the training distribution.

我也有这个问题 我训练的时候accuracy是0.67 但是在测试集上就是0.44 试了两个不同的base model都是这样的结果

我进行评估的模型是checkpoint200的adapter,已经过拟合了。但是测试集的精确度movie-movie也有64,movie-book有60.我观察训练的log,我看大概第100个epoch的性能比较好,不过我还没尝试。

我尝试训练100次也还是同样的结果 测试集很差 而且我还尝试了用别的base model也是这样

你好,请问你有检查测试的时候读取的lora文件(adapter.bin)是否正常吗?

检查了 在模型保存前后的lora weights是一致的 保存后的bin文件大小也是对的

那请问您load模型成功了吗?指load模型以后load的参数确实是保存的参数,而不是随机的一个参数(

成功了 我都打印出来了 和保存的时候的参数一模一样