Closed yflyzhang closed 7 months ago
Thanks for your interest in our work. In our work, we do not explicitly optimize the KL divergence as shown in Eq 2. The final objective in Eq 2. can be derived into two instruction-tunning tasks as shown in Eq 7, 8, 10. Therefore, it can be easily optimized with the existing LLM training objectives (predict the next tokens) to improve training speed. The detailed derivation can be found at section A.1.
Thank you very much for the clarification and additional details regarding the implementation. This clarifies my initial confusion about the implementation of the optimization.
Once again, thanks for your work!
Thanks for sharing this work!
The paper mentions using KL-divergence as a loss function for planning optimization. However, I couldn't locate the code that implements this KL-divergence loss, along with the loss of retrival-reason.
Could you or someone else please point me to the relevant files or provide more information on where these components are implemented?