clinicalml / co-llm

Co-LLM: Learning to Decode Collaboratively with Multiple Language Models
https://arxiv.org/abs/2403.03870
86 stars 7 forks source link

Questions Regarding the Experimental Results #6

Closed hechengbo-H closed 2 months ago

hechengbo-H commented 3 months ago

Hi, 

I recently read your work in detail and found the idea of combining large language models with learning to defer to be quite creative. However, I have a few questions that I would like to discuss with you.

  1. In Table 1, which shows Collaboration across domains, it seems that the "Co-LLM-7B + LLEMMA-34B" method does not consistently outperform the "PT (LLEMMA-34B + LLEMMA-7B)" method in several tasks, such as GSM, Factoid, List, and Yes/No. (I'll highlight the highest value in red)
  2. In Table 2, which shows Collaboration across scales, it appears that the "Co-LLM-7B + LLAMA-70B" method does not outperform the "LLAMA-70B+7B (PT)" method across all tasks (except for the List task). Additionally, the collaborative approach does not seem to perform better than the standalone LLAMA-70B (QLoRA). (I'll highlight the highest value in red) Given these observations, it seems that your method does not outperform the PT approach or a single larger language model in terms of the metrics. Could you please explain how your method demonstrates its effectiveness? I look forward to your response and wonder if others have had similar thoughts. I hope to further communicate and learn from you. image

  

lolipopshock commented 3 months ago

Thank you for your comment. I think you have two primary questions for our paper:

Q1: Co-LLM is not as strong as the Proxy Tuning baseline

  • In Table 1, which shows Collaboration across domains, it seems that the "Co-LLM-7B + LLEMMA-34B" method does not consistently outperform the "PT (LLEMMA-34B + LLEMMA-7B)" method in several tasks, such as GSM, Factoid, List, and Yes/No.
  • In Table 2, which shows Collaboration across scales, it appears that the "Co-LLM-7B + LLAMA-70B" method does not outperform the "LLAMA-70B+7B (PT)" method across all tasks (except for the List task).

As we mentioned the paper already (the last paragraph of 5.1, and last paragraph of 5.2), Co-LLM is better than PT in terms of:

Q2: Co-LLM might not be as strong as QLoRA tuning

Additionally, the collaborative approach does not seem to perform better than the standalone LLAMA-70B (QLoRA). (I'll highlight the highest value in red)

hechengbo-H commented 3 months ago

Thank you for your response. I had overlooked the aspect of efficiency earlier. I'll make sure to pay closer attention to it moving forward.