akihikodaki / cld3-ruby

cld3-ruby is an interface of Compact Language Detector v3 (CLD3) for Ruby.
Apache License 2.0
77 stars 22 forks source link

Bad identification for short input #44

Closed mariozaizar closed 7 months ago

mariozaizar commented 7 months ago

Same as https://github.com/google/cld3/issues/31, I have a few examples where this gem (and/or cld3) performs poorly with small texts:

Another example from v3.6.0:

CLD3::NNetLanguageIdentifier.new(0, 1000).find_language("User ID FA1324102A6E2C72 How to add my name on leader board?")
=> #<struct CLD3::NNetLanguageIdentifier::Result language=:ja, probability=0.7837570905685425, reliable?=true, proportion=1.0, byte_ranges=[]>

CLD3::NNetLanguageIdentifier.new(0, 1000).find_language("AAAA AA A0000000A0A0AAA AAA AA AAA AA AAAA AA AAAAAA")
=> #<struct CLD3::NNetLanguageIdentifier::Result language=:ja, probability=0.7837570905685425, reliable?=true, proportion=1.0, byte_ranges=[]>

How in the world that text is detected as JA 🤯

akihikodaki commented 7 months ago

Unfortunately, the capability of this gem is limited to what https://github.com/google/cld3 provides. You have to find an alternative if it is not satisfactory.