aadsm / jschardet

Character encoding auto-detection in JavaScript (port of python's chardet)
GNU Lesser General Public License v2.1
714 stars 97 forks source link

EUC-JP wrongly detected in this case that contains german umlaut #29

Open bpasero opened 7 years ago

bpasero commented 7 years ago

The following file detects as EUC-JP even though it is not. Seems to be caused by a single ü inside that file.

File: QuietLight.tmTheme.txt

aadsm commented 7 years ago

Yeah, this is really tricky. Encoding detection is not deterministic (for most cases) and relies on heuristic methods. This is why it will never be 100% reliable. Also, the smaller the text is the worse it will be because there is not enough data to statistically analyze like you see here: https://github.com/aadsm/jschardet/issues/30

jschardet.detect returns the encoding with the best confidence but you can set jschardet.Constants._debug = true; to see the confidence of all other encodings, can you see what are the other encodings that it detects?

bpasero commented 7 years ago

@aadsm here is the output:

EUC-TW prober hit error at byte 207

UTF-8 confidence = 0.505

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.99

GB2312 confidence = 0

EUC-KR confidence = 0.99

Big5 confidence = 0

EUC-TW not active

UTF-8 confidence = 0.505

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.99

GB2312 confidence = 0

EUC-KR confidence = 0.99

Big5 confidence = 0

EUC-TW not active

EUC-JP confidence 0.99
windows-1251 confidence = 0.01

KOI8-R confidence = 0.01

ISO-8859-5 confidence = 0

MacCyrillic confidence = 0.01

IBM866 confidence = 0.01

IBM855 confidence = 0.01

ISO-8859-7 confidence = 0

windows-1253 confidence = 0

ISO-8859-5 confidence = 0

windows-1251 confidence = 0.01

ISO-8859-2 confidence = 0.8511313029424628

windows-1250 confidence = 0.8511313029424628

TIS-620 confidence = 0

windows-1255 confidence = 0

windows-1255 confidence = 0.01

windows-1255 confidence = 0.01

windows-1251 confidence = 0.01

KOI8-R confidence = 0.01

ISO-8859-5 confidence = 0

MacCyrillic confidence = 0.01

IBM866 confidence = 0.01

IBM855 confidence = 0.01

ISO-8859-7 confidence = 0

windows-1253 confidence = 0

ISO-8859-5 confidence = 0

windows-1251 confidence = 0.01

ISO-8859-2 confidence = 0.8511313029424628

windows-1250 confidence = 0.8511313029424628

TIS-620 confidence = 0

windows-1255 confidence = 0

windows-1255 confidence = 0.01

windows-1255 confidence = 0.01

ISO-8859-2 confidence 0.8511313029424628
windows-1252 confidence 0.95
UTF-8 confidence = 0.505

SHIFT_JIS confidence = 0.01

EUC-JP confidence = 0.99

GB2312 confidence = 0

EUC-KR confidence = 0.99

Big5 confidence = 0

EUC-TW not active

{ encoding: 'EUC-JP', confidence: 0.99 }
workflo commented 7 years ago

AFAIK Visual Studio Code uses jschardet and some of us users experience the very same problem in VSC: https://github.com/Microsoft/vscode/issues/4891

bpasero commented 7 years ago

I was reporting this on behalf of VS Code.