Questions Regarding the Experimental Results

hechengbo-H commented 5 months ago

Hi,

I recently read your work in detail and found the idea of combining large language models with learning to defer to be quite creative. However, I have a few questions that I would like to discuss with you.

In Table 1, which shows Collaboration across domains, it seems that the "Co-LLM-7B + LLEMMA-34B" method does not consistently outperform the "PT (LLEMMA-34B + LLEMMA-7B)" method in several tasks, such as GSM, Factoid, List, and Yes/No. （I'll highlight the highest value in red）
In Table 2, which shows Collaboration across scales, it appears that the "Co-LLM-7B + LLAMA-70B" method does not outperform the "LLAMA-70B+7B (PT)" method across all tasks (except for the List task). Additionally, the collaborative approach does not seem to perform better than the standalone LLAMA-70B (QLoRA). （I'll highlight the highest value in red） Given these observations, it seems that your method does not outperform the PT approach or a single larger language model in terms of the metrics. Could you please explain how your method demonstrates its effectiveness? I look forward to your response and wonder if others have had similar thoughts. I hope to further communicate and learn from you.

lolipopshock commented 5 months ago

Thank you for your comment. I think you have two primary questions for our paper:

Q1: Co-LLM is not as strong as the Proxy Tuning baseline

In Table 1, which shows Collaboration across domains, it seems that the "Co-LLM-7B + LLEMMA-34B" method does not consistently outperform the "PT (LLEMMA-34B + LLEMMA-7B)" method in several tasks, such as GSM, Factoid, List, and Yes/No.

In Table 2, which shows Collaboration across scales, it appears that the "Co-LLM-7B + LLAMA-70B" method does not outperform the "LLAMA-70B+7B (PT)" method across all tasks (except for the List task).

As we mentioned the paper already (the last paragraph of 5.1, and last paragraph of 5.2), Co-LLM is better than PT in terms of:

PT only performs well when all three models (M,M+, M−) are pretrained on the same domain mix (compare, e.g. “LLEMMA + LLAMA” to “LLEMMA + LLEMMA”), and Co-LLM is more effective at enabling collaboration between models from different domains. For example, in the future, when one only pretrains a 70B domain specific model without the 7B version, then it's suitable to use Co-LLM.
In terms of efficiency, PT also requires more calls to the larger model, thus resulting in slower inference (PT needs to call 3 models for each token generated). Co-LLM makes fewer calls to both large and small models (in fact, only calling the larger model for a fraction of times).

Q2: Co-LLM might not be as strong as QLoRA tuning

Additionally, the collaborative approach does not seem to perform better than the standalone LLAMA-70B (QLoRA). （I'll highlight the highest value in red）

QLoRA aims to modify the 70B model weights, whereas Co-LLM only tunes a 7B model and keeps the 70B model unchanged.
Similarly, for the QLoRA baseline, it calls the 70B model for all tokens generated, where as the Co-LLM only uses the 70B model for a fraction of tokens.
Given these, in a sense, it is somewhat unexpected and surprising that Co-LLM can sometimes beat QLoRA. We don't think that's a negative result rather than an interesting finding. It reveals that one can do some "clever" work during training such that one can achieve the same or better effect without tuning the 70B model, which can be a promising direction for future work.

hechengbo-H commented 5 months ago

Thank you for your response. I had overlooked the aspect of efficiency earlier. I'll make sure to pay closer attention to it moving forward.

clinicalml / co-llm

Questions Regarding the Experimental Results #6