fani-lab / LADy

LADy 💃: A Benchmark Toolkit for Latent Aspect Detection Enriched with Backtranslation Augmentation
Other
3 stars 3 forks source link

Incorporating Underrepresented Languages: A Focus on Low-Resource Languages #70

Open farinamhz opened 5 months ago

farinamhz commented 5 months ago

In this step, we address the challenge of incorporating underrepresented languages with a focus on low-resource languages. This effort confronts the prevalent imbalance in NLP systems, which are predominantly oriented towards high-resource languages such as English, Chinese, and Spanish. These languages benefit from extensive digital resources, including large text corpora, facilitating their dominance in NLP research. Conversely, low-resource languages like Lao and Sanskrit are characterized by a scarcity of digital resources. Our aim is to highlight these underrepresented languages (Lao and Sanskrit as the candidates from this group), recognizing and exploring their unique linguistic features. By integrating these languages, we strive to develop truly language-agnostic system and embrace the full spectrum of global linguistic diversity.

farinamhz commented 5 months ago

For the backtranslation phase in our experiments with these languages, we employ nllb. The parameters for specifying the languages will be lao_Laoo for Lao and san_Deva for Sanskrit. The outcomes of these experiments will be integrated into LADy version 0.2.0.0, which already contains results from the nllb translator.