LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
37.07k stars 3.24k forks source link

Supporting new language #2974

Closed amirjalaly closed 1 year ago

amirjalaly commented 1 year ago

How is it possible to add the support of a new language? The performance of the chat in English is very good, it does not have many languages including my native one i.e. Farsi (Persian). How is it possible to add a language to the system by ourselves? Suppose, in a small scenario, it is possible to collect Persian data and sentence ranking dataset by ourselves

someone13574 commented 1 year ago

To add a language follow you simply need to translate the site. Here are a few pull requests that show how to do it.

https://github.com/LAION-AI/Open-Assistant/pull/1390/files https://github.com/LAION-AI/Open-Assistant/pull/2271/files https://github.com/LAION-AI/Open-Assistant/pull/2386/files

amirjalaly commented 1 year ago

I mean adding a new language support to LLM not the site

olliestanley commented 1 year ago

I mean adding a new language support to LLM not the site

The two are equivalent. If you translate the site, OA will start collecting data in the new language and then the LLM could be tuned with that data in future.

pourmand1376 commented 1 year ago

I think that amount of data is not enough. For LLM to understand farsi, It needs to see at least 10GB text in Persian which is completely available on Wikipedia. Are there any plans to officially support farsi?

stefangrotz commented 1 year ago

If you have data in farsi you can add an import script in the data folder: https://github.com/LAION-AI/Open-Assistant/tree/main/data/datasets

Unfortunately Wikipedia is only good to train Base Models, not fine tune dialogue models like OA. For OA you need dialogue data. But you could expand the Tatoeba import script for Farsi relatively easily.