Exploring the Potential of High-Quality Synthetic Datasets Using Translation-agent to Enhance the Performance of Multilingual Large Language Models

universea commented 5 months ago

Background In the field of multilingual large models, especially for non-English corpora, there is often a problem of insufficient data quantity and poor quality. High-quality training data is crucial for model performance, particularly for under-resourced minority languages where a lack of sufficient training data is often a major limiting factor. Therefore, could translating an English corpus into various minority languages to create high-quality synthetic data corpora be a potential solution?

Proposal The following steps might help generate and utilize synthetic datasets to enhance the capabilities of large multilingual models, especially for minority languages:

Selection and Optimization of Translation Models: Use the agent translation project or other efficient translation models as the translation tool to ensure high-quality output.

Generation of Synthetic Datasets: Utilize the aforementioned translation model to process English data, generating synthetic datasets in various target languages. Clean and validate the generated data to ensure its quality.

Training and Fine-Tuning of Multilingual Large Language Models: Use the synthetic datasets to train or fine-tune multilingual large models, observing improvements in the models' ability to process multilingual texts.

Evaluation and Iteration: Assess the model performance through standardized tests and real-world scenario testing, with a particular focus on the performance in minority languages, and iterate on the model and data generation processes based on feedback.

Discussion This approach could greatly promote the generation of data and training of models for minority languages, as well as improve the overall performance and application breadth of multilingual models. I am eager to discuss the feasibility of this plan, particularly regarding the technical implementation and resource requirements. If you are interested in this topic or have relevant experience, please join the discussion and explore this field together.

enismaxim1 commented 5 months ago

Having conducted research in a closely-related topic, I think that this approach is unlikely to work.

A useful mental model for how LLMs operate on different languages is the following: they first convert text (in any particular language) into a shared vector space representing meaning. Crucially, this vector space is language-agnostic: the LLM will convert a Russian text and an English text of equal meaning into a nearly-equal part of the vector space. The conversion from language into meaning is just translation, and the LLM will be lossy on this conversion on precisely the languages in which it is less effective at translating.

But the point is that the LLMs ability to effectively process low-resource language text is limited by its underlying translation ability. If we generate synthetic data, then we are limited by exactly the same thing, meaning that we are unlikely to be able to extract extra signal from synthetic data with this approach.

A related way to generate sytnethic data in a way that does work is to use LLMs, which are better at translating particular documents than existing translation systems, in order to train the smaller translation systems to match the translation performance of LLMs. If you're interested, see the issue #9 (or the paper https://arxiv.org/abs/2404.13813).

siddhantx0 commented 5 months ago

i can try hindi/bhojpuri bad words...

random-yang commented 5 months ago

Having conducted research in a closely-related topic, I think that this approach is unlikely to work.

A useful mental model for how LLMs operate on different languages is the following: they first convert text (in any particular language) into a shared vector space representing meaning. Crucially, this vector space is language-agnostic: the LLM will convert a Russian text and an English text of equal meaning into a nearly-equal part of the vector space. The conversion from language into meaning is just translation, and the LLM will be lossy on this conversion on precisely the languages in which it is less effective at translating.

But the point is that the LLMs ability to effectively process low-resource language text is limited by its underlying translation ability. If we generate synthetic data, then we are limited by exactly the same thing, meaning that we are unlikely to be able to extract extra signal from synthetic data with this approach.

A related way to generate sytnethic data in a way that does work is to use LLMs, which are better at translating particular documents than existing translation systems, in order to train the smaller translation systems to match the translation performance of LLMs. If you're interested, see the issue #9 (or the paper https://arxiv.org/abs/2404.13813).

You mean that even for LLM, the low-resource language knowledge it contains is very little, resulting in our inability to effectively distill this knowledge for further improvement of low-resource language translation?

enismaxim1 commented 5 months ago

No, I just mean that translating English data to a low-resource language using an LLM and then training it on that same data is unlikely to yield any performance improvement. The argument is that the LLMs level of multilinguality should come from its ability to translate therefore translation does not provide any additional useful data to the LLM.

universea commented 5 months ago

No, I just mean that translating English data to a low-resource language using an LLM and then training it on that same data is unlikely to yield any performance improvement. The argument is that the LLMs level of multilinguality should come from its ability to translate therefore translation does not provide any additional useful data to the LLM.

Thank you for your comment. And, I want to know if this method can improve the reasoning, logic, and mathematical abilities of large models in low-resource languages.

enismaxim1 commented 5 months ago

I would guess that you cannot. The reasoning/logic/math capabilities should be independent of any particular language (instead operating on the shared latent space of meaning). Therefore I would expect that the logic/math/reasoning capabilities should again be bottlenecked by translation/language understanding, which I don't expect this method can improve.

siddhantx0 commented 5 months ago

Your point about reasoning being language-independent is valid. However, language-specific training might still enhance performance indirectly by improving comprehension and expression. While core logic remains unchanged, better language understanding can facilitate more accurate interpretations and responses, potentially leading to improved overall reasoning capabilities.

universea commented 5 months ago

I mostly support your viewpoints, but I have also noticed a phenomenon. For example, many large models can answer an English math question quite well, but when the same math question is presented in other languages, the accuracy and quality of the answers from the large model tend to decline. If we translate the math question from a lesser-known language into English, the quality and accuracy of the model's answers improve. How do you view this issue?

siddhantx0 commented 5 months ago

How to append new remote languages

siddhantx0 commented 5 months ago

I mostly support your viewpoints, but I have also noticed a phenomenon. For example, many large models can answer an English math question quite well, but when the same math question is presented in other languages, the accuracy and quality of the answers from the large model tend to decline. If we translate the math question from a lesser-known language into English, the quality and accuracy of the model's answers improve. How do you view this issue?

More languages means more data means more insight

enismaxim1 commented 5 months ago

I mostly support your viewpoints, but I have also noticed a phenomenon. For example, many large models can answer an English math question quite well, but when the same math question is presented in other languages, the accuracy and quality of the answers from the large model tend to decline. If we translate the math question from a lesser-known language into English, the quality and accuracy of the model's answers improve. How do you view this issue?

This is interesting. If it is true, then these methods should work. Did you translate the problem into English using an LLM or with something else (like Google)?

andrewyng / translation-agent

Exploring the Potential of High-Quality Synthetic Datasets Using Translation-agent to Enhance the Performance of Multilingual Large Language Models #18