BYVoid / uchardet

An encoding detector library ported from Mozilla
Other
609 stars 107 forks source link

lower case german umlauts in utf-8 are detected incorrectly #41

Closed klemens-u closed 6 years ago

klemens-u commented 6 years ago

Test in shell:

echo -n ä | uchardet -> TIS-620

echo -n ö | uchardet -> TIS-620

echo -n ü | uchardet -> ISO-8859-7

Upper case works ok. Ä,Ö,Ü and also ß

System: Ubuntu 16.04

Jehan commented 6 years ago

As written on the README, uchardet moved. This has not been the official repository anymore for at least 2 years now. Uchardet is now a Freedesktop project. Please open reports there: https://gitlab.freedesktop.org/uchardet/uchardet/issues

This beeing said, uchardet works in a statistic way. It is basically impossible to detect a "language" or a charset for single characters. These can be just anything. Binary wise, these will be a few random bytes and result can only be random (if it were to return the right encoding, then that would be the strange part!). So yeah, uchardet needs a sentence, or at the very least several words to have enough to guess the right encoding.

So I will close this report. Feel free to reopen one at the Freedesktop gitlab, if relevant, by considering that uchardet will never be able to detect encoding of a single character (and no system technically will ever be able to, if not by chance).