argilla-io / notus

Notus is a collection of fine-tuned LLMs using SFT, DPO, SFT+DPO, and/or any other RLHF techniques, while always keeping a data-first approach
MIT License
161 stars 14 forks source link

【question】are you planning to support multi language model? #14

Closed Yongtae723 closed 10 months ago

Yongtae723 commented 10 months ago

Hi everyone!

Thanks for opening great project to make high quality data and LLM model like Notux and Notus! I appreciate your passion to publish everything you do. I have no doubt that you are producing a great PROGRESS in the development of AI in the world.

Let me ask following questions.

I think creating multi-language model can accelate to develop Open LLM model in world scale

I wish you will give me reply! Thanks!

alvarobartt commented 10 months ago

Hi here @Yongtae723, sorry I missed your issue before! And thanks for the kind words towards our work 🤗

Do you have any plan to support multi language model?

Yes, indeed Mixtral is multi-lingual (English, Spanish, German, Italian, and French), but we may also look into other languages, while for us curating datasets on both Spanish and English is the easiest because those are either native or first languages to us.

Or can I contribute somehow to make LLM model which support Japanese? I tested Japanese and result was close to perfect. I felt NOTUS have basic knowledge of Japan but Japanese is not natural.

It may hallucinate or generate tokens that are indeed within the unsupervised dataset within the pre-training stage, but the documents used to pre-train the base models for Notus and Notux, Mistal and Mixtral, respectively, have been trained either primarily on English documents or English, Spanish, German, Italian, and French ones, respectively. That being said, some characters, words, or sentences from another languages may have been introduced within the pre-training data, so it can generate those but doesn't have enough data to successfully do so.

I'm afraid we won't be able to produce fine-tuning data in Japanese as we're not Japanese speakers, but if that's something you and/or the Japan ML community are willing to do, we can always coordinate and collaboratively build a dataset using Argilla.

That being said, we will for sure continue our data curation efforts, while remaining focused on Spanish and English high quality datasets, while eventually fine-tuning some LLMs with the data we generate and/or curate.