Closed hechengbo-H closed 5 months ago
Thank you for your comment. I think you have two primary questions for our paper:
Q1: Co-LLM is not as strong as the Proxy Tuning baseline
- In Table 1, which shows Collaboration across domains, it seems that the "Co-LLM-7B + LLEMMA-34B" method does not consistently outperform the "PT (LLEMMA-34B + LLEMMA-7B)" method in several tasks, such as GSM, Factoid, List, and Yes/No.
- In Table 2, which shows Collaboration across scales, it appears that the "Co-LLM-7B + LLAMA-70B" method does not outperform the "LLAMA-70B+7B (PT)" method across all tasks (except for the List task).
As we mentioned the paper already (the last paragraph of 5.1, and last paragraph of 5.2), Co-LLM is better than PT in terms of:
Q2: Co-LLM might not be as strong as QLoRA tuning
Additionally, the collaborative approach does not seem to perform better than the standalone LLAMA-70B (QLoRA). (I'll highlight the highest value in red)
Thank you for your response. I had overlooked the aspect of efficiency earlier. I'll make sure to pay closer attention to it moving forward.
Hi,
I recently read your work in detail and found the idea of combining large language models with learning to defer to be quite creative. However, I have a few questions that I would like to discuss with you.