【question】are you planning to support multi language model?

Hi here @Yongtae723, sorry I missed your issue before! And thanks for the kind words towards our work 🤗

Do you have any plan to support multi language model?

Yes, indeed Mixtral is multi-lingual (English, Spanish, German, Italian, and French), but we may also look into other languages, while for us curating datasets on both Spanish and English is the easiest because those are either native or first languages to us.

Or can I contribute somehow to make LLM model which support Japanese? I tested Japanese and result was close to perfect. I felt NOTUS have basic knowledge of Japan but Japanese is not natural.

It may hallucinate or generate tokens that are indeed within the unsupervised dataset within the pre-training stage, but the documents used to pre-train the base models for Notus and Notux, Mistal and Mixtral, respectively, have been trained either primarily on English documents or English, Spanish, German, Italian, and French ones, respectively. That being said, some characters, words, or sentences from another languages may have been introduced within the pre-training data, so it can generate those but doesn't have enough data to successfully do so.

I'm afraid we won't be able to produce fine-tuning data in Japanese as we're not Japanese speakers, but if that's something you and/or the Japan ML community are willing to do, we can always coordinate and collaboratively build a dataset using Argilla.

That being said, we will for sure continue our data curation efforts, while remaining focused on Spanish and English high quality datasets, while eventually fine-tuning some LLMs with the data we generate and/or curate.

argilla-io / notus

【question】are you planning to support multi language model? #14