Open Kosei1227 opened 2 weeks ago
Hi @Kosei1227!
Thank you very much for your insightful question. It took us a few days to notice it, and we’re now pleased to address it. We would like to share the following insights based on our experimental findings:
The effectiveness of the first stage:
The necessity of query translation data for the second stage:
Thank you again for your insightful comments!
Hi! Thank you for your outstanding work!
I have been working on improving the LangBridge approach, and I noticed your paper referenced it. As you discussed, LangBridge uses soft prompts generated by an encoder model such as mT5. Although this method is powerful, even when using only a linear mapping layer, it does not leverage the language model’s embeddings from the input texts. Consequently, for LLMs like Gemma2 and other robust multilingual models, their own embedding information is effectively lost, causing LangBridge to encounter difficulties in bridging with these models.
Motivated by this limitation, I hypothesized that in addition to using the encoder outputs as soft prompts, incorporating the LLM's own embeddings would enhance the overall performance. I was excited to see that your research implements an idea similar to mine. However, I have a question regarding your training methodology.
From my understanding, during the initial mapping phase, you train the mapping layer to align the features of general English and target language texts. In the subsequent augmentation phase, the embeddings of the language model are used, akin to prompt-tuning.
Do we truly need query translation datasets for the augmentation stage? Like LangBridge, I suspect that using only English text datasets might suffice, as encoder-decoder models generally possess language-agnostic embeddings. I would like to better understand the effectiveness of your two-stage training strategy.
I hypothesize that the significant improvements in performance stem from the combination of soft prompts and the embeddings from the language model. Could you elaborate on the theoretical and empirical reasoning behind adopting the two-stage approach in this paper?
Thank you again for your contribution, and I look forward to your insights!