Mimino666 / langdetect

Port of Google's language-detection library to Python.
Other
1.7k stars 196 forks source link

Chinese language misclassified #98

Open johnbumgarner opened 2 years ago

johnbumgarner commented 2 years ago

I use langdect to classify the language of a website when the site does not have a lang attribute in the HTML. Occasionally langdect will misclassify a website written in Chinese. For example this website:

https://news.sina.com.cn/c/xl/2022-01-23/doc-ikyamrmz6973062.shtml

Is classified as Korean and not Chinese by langdect.

This is the title of the article -- 相约北京 习近平邀世界“共同见证”_手机新浪网

lang_code = langdetect.detect('相约北京 习近平邀世界“共同见证”_手机新浪网')
print(lang_code)
ko

Why does langdect classify the language of this website as Korean and not Chinese?

johnbumgarner commented 2 years ago

I see that this is a known issue with langdect -- https://github.com/Mimino666/langdetect/issues?q=Chinese

Why has this issue not be resolved after 7 years?

myfingerhurt commented 1 year ago

Not even remotely close.

text = "你可以使用开源的 Python库 Requests,通过Telegram Bot发送MP3音频文件",
detect(text)='ca' detect_langs(text)=[ca:0.7142840022485835, en:0.14285863189692477, vi:0.1428560781690179]

Found a partial solution from chatGPT, before using this you have to fix the ko profile.

import jieba
from langdetect import detect

def detect_mixed_language(text):
    words = jieba.cut(text)
    lang_count = {
        'zh-cn': 0,
        'en': 0,
        'fr': 0,
        'ja': 0,
        'ko': 0,
        'ru': 0,
        'es': 0,
    }
    for word in words:
        try:
            lang = detect(word)
            lang_count[lang] += 1
        except:
            pass

    if lang_count['zh-cn'] > lang_count['en'] and lang_count['zh-cn'] > lang_count['fr'] and lang_count['zh-cn'] > lang_count['ja'] and lang_count['zh-cn'] > lang_count['ko'] and lang_count['zh-cn'] > lang_count['ru'] and lang_count['zh-cn'] > lang_count['es']:
        return 'zh-cn'
    elif lang_count['en'] > lang_count['fr'] and lang_count['en'] > lang_count['ja'] and lang_count['en'] > lang_count['ko'] and lang_count['en'] > lang_count['ru'] and lang_count['en'] > lang_count['es']:
        return 'en'
    elif lang_count['fr'] > lang_count['ja'] and lang_count['fr'] > lang_count['ko'] and lang_count['fr'] > lang_count['ru'] and lang_count['fr'] > lang_count['es']:
        return 'fr'
    elif lang_count['ja'] > lang_count['ko'] and lang_count['ja'] > lang_count['ru'] and lang_count['ja'] > lang_count['es']:
        return 'ja'
    elif lang_count['ko'] > lang_count['ru'] and lang_count['ko'] > lang_count['es']:
        return 'ko'
    elif lang_count['ru'] > lang_count['es']:
        return 'ru'
    else:
        return 'es'
Dobatymo commented 1 year ago

I suggest to use pycld2 instead. It has some issues as well, but none as grave as langdetect imo.

myfingerhurt commented 1 year ago

Thank you @Dobatymo. Actually I have tried pycld2 but I was getting stuck in resolving dependencies, so I came back for this one.