Open zzysay opened 11 months ago
Hi, sorry for the last response. In the appendix A of our paper, we follow [1] to show that Z(x)≈ 1, and therefore the log Z(x) terms can be treated as 0 in practice. Meanwhile, to test the effect of various log Z(x), the config --ulma_constant_zx
can be used. Actually we are considering adding a new ablation study on testing how Z(x) affects the method's performance, and hopefully the results will be updated to the Arxiv in the coming few weeks. Hope that helps!
[1] Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I Jordan, and Jiantao Jiao. Fine-tuning language models with advantage-induced policy alignment.
ULMA is an elegant work and thanks for sharing the code!
In the original DPO and ULMA, there is a partition function Z(x), which is only related to the reference model and can be seen as a constant in the optimization. So, how to set constant_zx here? it seems always zero in all experiments.
Looking forward for your quickly reply! 😄