brianmario / charlock_holmes

Character encoding detection, brought to you by ICU
MIT License
1.04k stars 142 forks source link

Small Strings #160

Open etm opened 3 years ago

etm commented 3 years ago

CharlockHolmes::EncodingDetector.detect_all("Timeout: 2")

results in

{:type=>:text, :encoding=>"IBM424_ltr", :ruby_encoding=>"binary", :confidence=>27, :language=>"he"}, {:type=>:text, :encoding=>"UTF-8", :ruby_encoding=>"UTF-8", :confidence=>15}, ....

in general it seems to try too hard for small strings. for small strings it often favors esoteric (wrong) results over obvious ones. is it possible to tweak this? is this intended?