Open bpasero opened 7 years ago
Yeah, this is really tricky. Encoding detection is not deterministic (for most cases) and relies on heuristic methods. This is why it will never be 100% reliable. Also, the smaller the text is the worse it will be because there is not enough data to statistically analyze like you see here: https://github.com/aadsm/jschardet/issues/30
jschardet.detect
returns the encoding with the best confidence but you can set jschardet.Constants._debug = true;
to see the confidence of all other encodings, can you see what are the other encodings that it detects?
@aadsm here is the output:
EUC-TW prober hit error at byte 207
UTF-8 confidence = 0.505
SHIFT_JIS confidence = 0.01
EUC-JP confidence = 0.99
GB2312 confidence = 0
EUC-KR confidence = 0.99
Big5 confidence = 0
EUC-TW not active
UTF-8 confidence = 0.505
SHIFT_JIS confidence = 0.01
EUC-JP confidence = 0.99
GB2312 confidence = 0
EUC-KR confidence = 0.99
Big5 confidence = 0
EUC-TW not active
EUC-JP confidence 0.99
windows-1251 confidence = 0.01
KOI8-R confidence = 0.01
ISO-8859-5 confidence = 0
MacCyrillic confidence = 0.01
IBM866 confidence = 0.01
IBM855 confidence = 0.01
ISO-8859-7 confidence = 0
windows-1253 confidence = 0
ISO-8859-5 confidence = 0
windows-1251 confidence = 0.01
ISO-8859-2 confidence = 0.8511313029424628
windows-1250 confidence = 0.8511313029424628
TIS-620 confidence = 0
windows-1255 confidence = 0
windows-1255 confidence = 0.01
windows-1255 confidence = 0.01
windows-1251 confidence = 0.01
KOI8-R confidence = 0.01
ISO-8859-5 confidence = 0
MacCyrillic confidence = 0.01
IBM866 confidence = 0.01
IBM855 confidence = 0.01
ISO-8859-7 confidence = 0
windows-1253 confidence = 0
ISO-8859-5 confidence = 0
windows-1251 confidence = 0.01
ISO-8859-2 confidence = 0.8511313029424628
windows-1250 confidence = 0.8511313029424628
TIS-620 confidence = 0
windows-1255 confidence = 0
windows-1255 confidence = 0.01
windows-1255 confidence = 0.01
ISO-8859-2 confidence 0.8511313029424628
windows-1252 confidence 0.95
UTF-8 confidence = 0.505
SHIFT_JIS confidence = 0.01
EUC-JP confidence = 0.99
GB2312 confidence = 0
EUC-KR confidence = 0.99
Big5 confidence = 0
EUC-TW not active
{ encoding: 'EUC-JP', confidence: 0.99 }
AFAIK Visual Studio Code uses jschardet and some of us users experience the very same problem in VSC: https://github.com/Microsoft/vscode/issues/4891
I was reporting this on behalf of VS Code.
The following file detects as EUC-JP even though it is not. Seems to be caused by a single
ü
inside that file.File: QuietLight.tmTheme.txt