BYVoid / uchardet

An encoding detector library ported from Mozilla
Other
605 stars 106 forks source link

uchardet wrongly determines the text as UTF-8 #2

Closed yurivict closed 9 years ago

yurivict commented 9 years ago

The file has these bytes:

00000000  78 78 78 e2 80 99 78 78  78 0a 63 68 61 72 20 27  |xxx...xxx.char '|
00000010  e2 27 20 28 69 6e 0a 4d  69 6c 6f c5 a1 5f 46 6f  |.' (in.Milo.._Fo|
00000020  72 6d 61 6e 0a                                    |rman.|

Please note that it has 3 non-ascii areas:

1. e2 80 99: is U+2019 RIGHT SINGLE QUOTATION MARK
2. e2: could be UTF-8 3-char sequence, but bytes 27 20 don't make for any UTF-8 symbol
3. c5 a1: U+0161 LATIN SMALL LETTER S WITH CARON

However, uchardet determines that it is UFT-8:

$ uchardet < xxx 
UTF-8

FreeBSD file(1) determines this file as:

$ file xxx 
xxx: C source, Non-ISO extended-ASCII text

I am not sure how it should determine this text, but this isn't UTF-8 for sure.

BYVoid commented 9 years ago

Hi,

Character set detection can not be 100% accurate. What uchardet does is returning the most likely character set. The longer the input text is, the more accurate it will be.

yurivict commented 9 years ago

UTF-8 is pretty deterministic, byte sequence is either the correct UTF-8 or not. In this case it isn't. With other encodings, ex. when they are 8-bit encodings, it is often hard to determine which one it is. For some texts several encodings would qualify.

uchardet could as well return the list of possible answers in general, this would be more accurate.

BYVoid commented 9 years ago

uchardet is not a stand-alone project. It is only a wrap of Mozilla's universalchardet module. I can not fix this as a package maintainer.