Closed yurivict closed 9 years ago
Hi,
Character set detection can not be 100% accurate. What uchardet does is returning the most likely character set. The longer the input text is, the more accurate it will be.
UTF-8 is pretty deterministic, byte sequence is either the correct UTF-8 or not. In this case it isn't. With other encodings, ex. when they are 8-bit encodings, it is often hard to determine which one it is. For some texts several encodings would qualify.
uchardet could as well return the list of possible answers in general, this would be more accurate.
uchardet is not a stand-alone project. It is only a wrap of Mozilla's universalchardet module. I can not fix this as a package maintainer.
The file has these bytes:
Please note that it has 3 non-ascii areas:
However, uchardet determines that it is UFT-8:
FreeBSD file(1) determines this file as:
I am not sure how it should determine this text, but this isn't UTF-8 for sure.