Closed longday1102 closed 3 months ago
Here are the losses of llama2-7b verifier in GSM8K. Hope it can help!
Oh thanks! Due to limited resources, I have to train using QLoRA, but my loss is very high (much higher than your loss). Is it possible that QLoRA used to train this task is the cause of such a high loss?
Sorry that I am unfamiliar with QLoRA, so it is hard to tell the reason for the high loss directly.
But I can provide another low-resource experiment result, and my finding on it: This is the loss of lama2-7b verifier in GSM8K froze all parameters in the backbone, only tuning the last added layer for regression. Since the backbone is frozen, llm loss remains unchanged.
A surprising finding is that the inference performance with beam search of that verifier is better than fully finetuned mistral7b backbone. This is interesting because we know that mistral7b is generally more powerful than llama2-7b. When reversing the models, the comparison result is the same: only finetuned the last added layer from mistral7b generator > fully finetuned from llama2-7b base.
My hypothesis on this finding is that a better value model is the one that knows the distribution of the generator, because it requires predicting "whether the generator can perform well in the future". So, initialization from the generator checkpoint can be better than initialization from a more powerful model checkpoint (even with fully finetuning). But the final performance is far from the one fully finetuned from the generator checkpoint (69% -> 35% in K20, worse than the majority voting 53%). I think the value function is complex and simply adjusting the last layer is too simple to learn well
Back to the discussion, I believe QLoRA can lead to a better result than simply finetuning the last layer. It shares a very similar distribution to the generator and has a better learning capability than the last layer only. So if the result is worse than only tuning the last layer, something may be wrong (learning rate? I don't know).
By "last added layer" do you mean "vscore_head" layer, "gain" and "bias"?
yes, exactly
Thank you so much, truly a great idea
When starting to train the verifier, is all_losses high?