CONE-MT / MindMerger

MIT License
25 stars 1 forks source link

Need for Theoretical Background on MindMerger Approach #7

Open Kosei1227 opened 2 weeks ago

Kosei1227 commented 2 weeks ago

Hi! Thank you for your outstanding work!

I have been working on improving the LangBridge approach, and I noticed your paper referenced it. As you discussed, LangBridge uses soft prompts generated by an encoder model such as mT5. Although this method is powerful, even when using only a linear mapping layer, it does not leverage the language model’s embeddings from the input texts. Consequently, for LLMs like Gemma2 and other robust multilingual models, their own embedding information is effectively lost, causing LangBridge to encounter difficulties in bridging with these models.

Motivated by this limitation, I hypothesized that in addition to using the encoder outputs as soft prompts, incorporating the LLM's own embeddings would enhance the overall performance. I was excited to see that your research implements an idea similar to mine. However, I have a question regarding your training methodology.

From my understanding, during the initial mapping phase, you train the mapping layer to align the features of general English and target language texts. In the subsequent augmentation phase, the embeddings of the language model are used, akin to prompt-tuning.

Do we truly need query translation datasets for the augmentation stage? Like LangBridge, I suspect that using only English text datasets might suffice, as encoder-decoder models generally possess language-agnostic embeddings. I would like to better understand the effectiveness of your two-stage training strategy.

I hypothesize that the significant improvements in performance stem from the combination of soft prompts and the embeddings from the language model. Could you elaborate on the theoretical and empirical reasoning behind adopting the two-stage approach in this paper?

Thank you again for your contribution, and I look forward to your insights!

HuangZixian commented 1 week ago

Hi @Kosei1227!

Thank you very much for your insightful question. It took us a few days to notice it, and we’re now pleased to address it. We would like to share the following insights based on our experimental findings:

  1. The effectiveness of the first stage:

    • This stage is essential for aligning the representational space of LLMs and multilingual encoders, particularly for low-resource languages. As shown in Table 11 of our paper, removing this stage results in an 11.0%-20.2% drop in mathematical reasoning performance for low-resource languages.
    • Note that LangBridge relies on task-specific data for alignment, and since it is difficult to obtain sufficient task data, this can limit the effectiveness of the space alignment. In contrast, we chose translation data for training in this stage, as it offers a plentiful supply of training samples across a wide range of languages.
  2. The necessity of query translation data for the second stage:

    • As you mentioned, we also attempted to train the second stage using only English data, without query translation data, but this approach is unsuccessful. The reason is that when only English data is used, the model tends to rely on the LLM's own embeddings during training, ignoring the input from the multilingual encoder. This occurs because the LLM's own embeddings are consistent with its representation space, which is easier for the model to understanding than the multilingual encoder’s input. As a result, training the second stage with only English data leads the model to forget the space alignment learned in the first stage.
    • Using query translation data prevents the LLMs from relying solely on their own embeddings. To minimize loss during training, for languages in which the LLMs are proficient, the models tend to prioritize their own embeddings, supplementing them with input from the multilingual encoder. Conversely, for languages in which the LLMs are less proficient, they tend to rely more on the multilingual encoder's input, complementing it with their own embeddings. Therefore, using query translation data encourages the model to learn to integrate both sources of input in a coordinated manner.

Thank you again for your insightful comments!