LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
37.1k stars 3.24k forks source link

Automatic translation of prompts and answer to multiple languages #1135

Open ParisNeo opened 1 year ago

ParisNeo commented 1 year ago

Hi there,

I am excited about this project. I have already used MBART50 from facebook to translate text from and to english and other languages and it was pretty solid translation.

We can augment the database greatly by capitalizing on the multiple languages if we make a cross languages translations to pump up the database. So if anyone adds a text in a language, be it in english or another one, we can translate it to all other languages while keeping the annotations.

We can automate this by adding a translation module.

People can continue to annotate translated prompts and answers which will help eliminating bad translations for example.

jeonsworld commented 1 year ago

I strongly agree!!!

We can imagine open-assistant working in multilingual by automating en->x through machine translation.

lainisourgod commented 1 year ago

I guess it will be a good addition to improving non-English-language tasks pool. You see, in Russian currently we have <50 tasks open, while English page has >500.

One way of using translation I guess is translate initial prompts. I guess when you present the same-meaning prompt in a different language you can get a very different answer.

Also, I guess it's a good idea for a model to present multiple ways to answer to the same question so that hopefully Assistant will better adapt it's answers to the unique user's context.

huu4ontocord commented 1 year ago

how might this be implemented - m2m100 or something else? run on periodic basis (not real time I think)? who will volunteer to implement?

ParisNeo commented 1 year ago

Yes i was thinking more like a kind of cron that runs periodically.

We can add a new flag to prompts saying if they have already been translated or not, then when the scheduler runs, it only translate those that were not translated. We have to design it so that we keep track of the translated languages, so that if we decide to add a new language, the scheduler updates everything.

Also, we need to add a jugement for the automatically translated stuff. So that users may rank if the translation is ok or not. And i was thinking that every message should have a base language and translations so that we can also grub usefull stuff writen in non english language.

I was making french prompts when I had this idea. You can multiply your fata by many folds like this.

ParisNeo commented 1 year ago

These days I have too much work, and I am still new to this project, so i guess someone who has much more expérience may fo that. It is basic database modifications with the use of one of the available translation models in python. A draft shouldn't be too hard to build. It will need a lot of tests, but the advantages make it worth it.

ParisNeo commented 1 year ago

If no one wants to do It I can build a branch and start working on a translation module, but I may need help to integrate this in the main project.

ParisNeo commented 1 year ago

As thanos said. Fine! I'll do it myself.

But I'll need some help.

Here is my plan: I've forked the project I've created a branch automatic_data_translator

I am a python coder, so I've built a python class that takes a text input and translates it to a second language. It is a very simple code using the MBart50 model from facebook (meta). It is integrated into the transformers package so this is really easy to use. When you execute the code, the models are downloaded and you can translate whatever you want.

I have already built the MBartTranslator class in auto_translate/mbart_translator.py I have also added an example code that shows how it translates from a non english language to a non english language.

The code supports 50 languages : Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI)

The translation is not perfect but I have shown how this can be done on multiple languages using MBart50. You can select another model path. We can also use one to many if we just want to go from english to other languages which is noticeably better. Or we can even use one to one if we select the right model. These models can be found on hugging face, they are free to use and the code I've made can use any one of them.

Now, the interesting part is what comes next. I am not yet familiar with the database structure, so I am asking if someone can do this in my place:

We create a special user called translator. This user is used when creating new prompts or answers using translation, it should be excluded from the leaderboard.

The idea is that before translating a text, the translator user verifies that this text was not written by him. Other wize we can fall into infinite loops.

So it just scans the database for non translated messages, and translate each one to all other languages using the code I've added then post it with him as the writer.

This should be executed periodically say every night for example. This allows us to know what was written by real users and what was written by the translator. Translations can be criticised by people using the classical rating tools.

I need a wizard who knows the database to do that.

Any one up for the challenge?

As of now, I'll do a pull request for my branch. If someone whants to push this further, please do it.

eihli commented 1 year ago

Is this still an active/valid TODO? ParisNeo, I noticed your PR was closed after a comment that the auto-translation quality might not be up to par. And since that date, there's been many other translation-related PRs.

I just came across this while looking to join in on some contributions. I'm concluding this TODO is safe to close as stale/won't-do and just wanted to make note of that to whoever manages this project board.

ParisNeo commented 1 year ago

Hi there. I'm sorry for neglecting this for long time I was working on my personal project called lollms and lollms-webui and had no time to go further. Today I have waaaay better translators than mbart. They use quantized llama 2, falcon or one of the 520 models on lollms zoo. Maybe i can make a script that uses lollms to translate open assistant's text. But since lollms became huge, i don't know if i'll ever have time for that.