lora微调-训练完直接预测得到的预测指标f1 ≠ 加载保存模型进行预测后得到的预测指标f1

Doufanfan commented 11 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

基于main.py代码改造的，加了lora微调的逻辑，目前没有实现多卡微调，只能单卡微调🤣

训练设置max_steps=5000，save_step=1000，训练完成后eval/predict f1=0.68/0.71；后加载checkpoint-5000的eval/predict f1=0.69/0.73😱；而且还发现一个奇怪的现象：加载checkpoint-5000模型，设置不同的per_device_eval_batch_size，eval/predict f1 也不同😱。。。 per_device_eval_batch_size=6：eval/predict f1=0.68/0.73 per_device_eval_batch_size=16：eval/predict f1=0.69/0.74

补充求一下有lora多卡微调的demo吗~ 双卡微调的时候会报错：

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 55 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

wizardforcel commented 11 months ago

不同样本算出来的评价指标是不一样的。可能刚好没有加载到同样的样本。

Doufanfan commented 11 months ago

不同样本算出来的评价指标是不一样的。可能刚好没有加载到同样的样本。

🤣 每次都是一样的验证、测试数据，我把样本直接按照train、eval、test三部分存了3个文件，每次执行每个部分的数据都是一样的🤭

wfllyzh commented 9 months ago

你好，请问单卡怎么改main.py用lora微调？我改完后跑代码直接报RuntimeError: Expected to mark a variable ready only once.这个了

THUDM / ChatGLM2-6B

lora微调-训练完直接预测得到的预测指标f1 ≠ 加载保存模型进行预测后得到的预测指标f1 #572

Is there an existing issue for this?

Current Behavior