Include Kirundi (Rundi - run_Latn) language in the No Language Left Behind (NLLB) project

🚀 Feature Request

Include Kirundi (Rundi) language in the No Language Left Behind (NLLB) project

Motivation

As researchers and developers from Burundi, we are deeply concerned about the absence of Kirundi (also known as Rundi) in the No Language Left Behind (NLLB) project. Kirundi is the national language of Burundi and is spoken by over 11 million people worldwide, including large diaspora communities. Its exclusion from NLLB significantly hampers our ability to develop inclusive, localized solutions for Kirundi speakers globally.

The lack of Kirundi in NLLB creates several critical issues:

Limited access to information: Kirundi speakers struggle to access vital information in their native language, especially in areas like health, education, and technology.
Hindered software development: Local developers face significant challenges in creating applications and services tailored to Kirundi speakers, limiting innovation and economic growth in our region.
Digital divide: The absence of Kirundi in major language models like NLLB widens the digital divide, leaving our community behind in the rapidly advancing world of AI and natural language processing.
Cultural preservation: Without proper representation in language models, there's a risk of Kirundi losing its digital presence, potentially impacting its long-term preservation and evolution.

Pitch

We propose the inclusion of Kirundi in the NLLB project. This addition would:

Enable accurate translation to and from Kirundi, facilitating better communication and information exchange for millions of speakers.
Empower local developers to create more sophisticated, language-specific applications and services.
Enhance natural language understanding capabilities for Kirundi, opening doors for advanced AI applications in areas such as voice recognition, text-to-speech, and sentiment analysis.
Contribute to the digital preservation of Kirundi, ensuring its relevance in the digital age.
Align with NLLB's mission of language inclusivity and bridging the gap for underrepresented languages.

Alternatives

While alternatives are limited, some developers have attempted to:

Use closely related languages like Kinyarwanda as a proxy, but this leads to inaccuracies and doesn't fully capture the nuances of Kirundi.
Develop smaller, less efficient language models specifically for Kirundi, but these lack the resources and scale of NLLB.
Rely on human translation, which is time-consuming, expensive, and not scalable for large-scale applications.

These alternatives are insufficient and emphasize the need for Kirundi's inclusion in a comprehensive project like NLLB.

Additional context

Kirundi is not just a language; it's a carrier of our culture, history, and identity. Its inclusion in NLLB would be a significant step towards digital equity and would open up numerous opportunities for innovation and development in Burundi and for Kirundi speakers worldwide.

We have a growing tech community eager to leverage advanced language models. The inclusion of Kirundi in NLLB would catalyze numerous projects and potentially transform our digital landscape.

Furthermore, Burundi's unique linguistic situation, with Kirundi as the primary language alongside French and English, presents an interesting case study for multilingual societies and could provide valuable data for improving NLLB's capabilities in similar contexts.

We are ready and willing to collaborate in any way possible to facilitate this inclusion, including providing language data, expert knowledge, and testing support.

Hi @labKnowledge! First, Rundi is actually present in the NLLB-200 model. Please take a look at the NLLB paper, where run_Latn is included in the language list on page 15. Also you could try our translation interface co-branded with UNESCO and Huggingface, https://huggingface.co/spaces/UNESCO/nllb, where translation for Rundi is supported.

Second, currently, there are no specific plans of releasing a new version of NLLB. However, there are specific steps you could undertake to improve the translation for Kirundi in the future releases of other multilingual translation models.

Translate the NLLB-Seed dataset into Kirundi. Its inclusion has been reported to greatly improve translation quality for Wikipedia-like domains. You can contribute it to the Open Language Data Initiative: https://oldi.org.
Collect other parallel texts for Kirundi that could serve as training data. You could include references to them to the Rundi language card at OLDI.
If you have parallel data, you could fine-tune an NLLB model (using e.g. this tutorial) to improve this Kirundi proficiency.

facebookresearch / fairseq