LlmKira / fast-langdetect

⚡️ 80x faster language detection with Fasttext | Split text by language for TTS
MIT License
101 stars 4 forks source link
detect-languages fasttext i18n language-identification languagedetector svc tts

fast-langdetect 🚀

PyPI version Downloads Downloads

Overview

fast-langdetect provides ultra-fast and highly accurate language detection based on FastText, a library developed by Facebook. This package is 80x faster than traditional methods and offers 95% accuracy.

It supports Python versions 3.9 to 3.12.

This project builds upon zafercavdar/fasttext-langdetect with enhancements in packaging.

For more information on the underlying FastText model, refer to the official documentation: FastText Language Identification.

[!NOTE] This library requires over 200MB of memory to use in low memory mode.

Installation 💻

To install fast-langdetect, you can use either pip or pdm:

Using pip

pip install fast-langdetect

Using pdm

pdm add fast-langdetect

Usage 🖥️

For optimal performance and accuracy in language detection, use detect(text, low_memory=False) to load the larger model.

The model will be downloaded to the /tmp/fasttext-langdetect directory upon first use.

Native API (Recommended)

from fast_langdetect import detect, detect_multilingual

# Single language detection
print(detect("Hello, world!"))
# Output: {'lang': 'en', 'score': 0.1520957201719284}

print(detect("Привет, мир!")["lang"])
# Output: ru

# Multi-language detection
print(detect_multilingual("Hello, world!你好世界!Привет, мир!"))
# Output: [
#     {'lang': 'ru', 'score': 0.39008623361587524},
#     {'lang': 'zh', 'score': 0.18235979974269867},
# ]

Convenient detect_language Function

from fast_langdetect import detect_language

# Single language detection
print(detect_language("Hello, world!"))
# Output: EN

print(detect_language("Привет, мир!"))
# Output: RU

print(detect_language("你好,世界!"))
# Output: ZH

Splitting Text by Language 🌐

For text splitting based on language, please refer to the split-lang repository.

Benchmark 📊

For detailed benchmark results, refer to zafercavdar/fasttext-langdetect#benchmark.

References 📚

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}