CharsetDetector / UTF-unknown

Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+
303 stars 46 forks source link

File detected as Windows-1250, but is UTF-8 #108

Open tobbi opened 4 years ago

tobbi commented 4 years ago

I'm using UTF.Unknown 2.3.0 The following file is detected as Windows-1250, but is UTF-8:

csv_test_correct_GZ.zip

rstm-sf commented 4 years ago

Hello, @tobbi !

Thank you for the report.

Could you add a text file? Why did you choose zip? Do you submit this to input?

tobbi commented 4 years ago

Sorry, my bad, it used to be a csv file and github wouldn't accept those. Here's the file with the extension changed to .txt:

csv_test_correct_GZ.txt

rstm-sf commented 4 years ago

Thanks for clarifying.

At first glance, I think the result is normal. Why? The algorithm by which detected is statistical, and, accordingly, the more different input data, the more accurate the final result. Details can be found in the "A composite approach to language/encoding detection" article.

But, we need to try to improve the result :)


Status Logs:

SBCS: Detected windows-1250 with confidence of 0.7738685 Get confidence: -- new match found: confidence 0.01, index 0, charset windows-1251. -- new match found: confidence 0.18598664, index 6, charset iso-8859-7. -- new match found: confidence 0.7133932, index 15, charset iso-8859-1. -- new match found: confidence 0.71340704, index 18, charset iso-8859-1. -- new match found: confidence 0.76677626, index 23, charset iso-8859-1. -- new match found: confidence 0.7738685, index 86, charset windows-1250. Get confidence done. SBCS Group Prober --------begin status SBCS 0.01: [windows-1251] SBCS: 0.01 [windows-1251] SBCS 0.01: [koi8-r] SBCS: 0.01 [koi8-r] SBCS 0: [iso-8859-5] SBCS: 0.00 [iso-8859-5] SBCS 0.01: [x-mac-cyrillic] SBCS: 0.01 [x-mac-cyrillic] SBCS 0.01: [ibm866] SBCS: 0.01 [ibm866] SBCS 0.01: [ibm855] SBCS: 0.01 [ibm855] SBCS 0.18598664: [iso-8859-7] SBCS: 0.1859866 [iso-8859-7] SBCS 0.18598664: [windows-1253] SBCS: 0.1859866 [windows-1253] SBCS 0: [iso-8859-5] SBCS: 0.00 [iso-8859-5] SBCS 0.01: [windows-1251] SBCS: 0.01 [windows-1251] SBCS 0: [windows-1255] HEB: 0 - 0 [Logical-Visual score] SBCS 0: [windows-1255] SBCS: 0.00 [windows-1255] SBCS 0: [windows-1255] SBCS: 0.00 [windows-1255] SBCS 0.09991017: [tis-620] SBCS: 0.09991017 [tis-620] SBCS 0.09991017: [iso-8859-11] SBCS: 0.09991017 [iso-8859-11] SBCS 0.7133932: [iso-8859-1] SBCS: 0.7133932 [iso-8859-1] SBCS 0.6674997: [iso-8859-15] SBCS: 0.6674997 [iso-8859-15] SBCS 0.7133932: [windows-1252] SBCS: 0.7133932 [windows-1252] SBCS 0.71340704: [iso-8859-1] SBCS: 0.713407 [iso-8859-1] SBCS 0.67082536: [iso-8859-15] SBCS: 0.6708254 [iso-8859-15] SBCS 0.71340704: [windows-1252] SBCS: 0.713407 [windows-1252] SBCS 0.6861101: [iso-8859-2] SBCS: 0.6861101 [iso-8859-2] SBCS 0.6861101: [windows-1250] SBCS: 0.6861101 [windows-1250] SBCS 0.76677626: [iso-8859-1] SBCS: 0.7667763 [iso-8859-1] SBCS 0.76677626: [windows-1252] SBCS: 0.7667763 [windows-1252] SBCS inactive: [iso-8859-3] (i.e. confidence is too low). SBCS inactive: [iso-8859-3] (i.e. confidence is too low). SBCS 0.717128: [iso-8859-9] SBCS: 0.717128 [iso-8859-9] SBCS inactive: [iso-8859-6] (i.e. confidence is too low). SBCS 0: [windows-1256] SBCS: 0.00 [windows-1256] SBCS 0.40016073: [viscii] SBCS: 0.4001607 [viscii] SBCS 0.44124976: [windows-1258] SBCS: 0.4412498 [windows-1258] SBCS 0.71854687: [iso-8859-15] SBCS: 0.7185469 [iso-8859-15] SBCS 0.7641578: [iso-8859-1] SBCS: 0.7641578 [iso-8859-1] SBCS 0.7641578: [windows-1252] SBCS: 0.7641578 [windows-1252] SBCS 0.71640146: [iso-8859-13] SBCS: 0.7164015 [iso-8859-13] SBCS 0.6377162: [iso-8859-10] SBCS: 0.6377162 [iso-8859-10] SBCS 0.6736411: [iso-8859-4] SBCS: 0.6736411 [iso-8859-4] SBCS 0.71818155: [iso-8859-13] SBCS: 0.7181816 [iso-8859-13] SBCS 0.6363546: [iso-8859-10] SBCS: 0.6363546 [iso-8859-10] SBCS 0.6753149: [iso-8859-4] SBCS: 0.6753149 [iso-8859-4] SBCS 0.666065: [iso-8859-1] SBCS: 0.666065 [iso-8859-1] SBCS 0.666065: [iso-8859-9] SBCS: 0.666065 [iso-8859-9] SBCS 0.62630904: [iso-8859-15] SBCS: 0.626309 [iso-8859-15] SBCS 0.666065: [windows-1252] SBCS: 0.666065 [windows-1252] SBCS inactive: [iso-8859-3] (i.e. confidence is too low). SBCS 0.6366351: [windows-1250] SBCS: 0.6366351 [windows-1250] SBCS 0.6366351: [iso-8859-2] SBCS: 0.6366351 [iso-8859-2] SBCS 0.72143143: [x-mac-ce] SBCS: 0.7214314 [x-mac-ce] SBCS 0.72143143: [ibm852] SBCS: 0.7214314 [ibm852] SBCS 0.6434225: [windows-1250] SBCS: 0.6434225 [windows-1250] SBCS 0.64008415: [iso-8859-2] SBCS: 0.6400841 [iso-8859-2] SBCS 0.7291228: [x-mac-ce] SBCS: 0.7291228 [x-mac-ce] SBCS 0.7253399: [ibm852] SBCS: 0.7253399 [ibm852] SBCS 0.58494663: [windows-1250] SBCS: 0.5849466 [windows-1250] SBCS 0.5881849: [iso-8859-2] SBCS: 0.5881849 [iso-8859-2] SBCS 0.61615247: [iso-8859-13] SBCS: 0.6161525 [iso-8859-13] SBCS 0.58494663: [iso-8859-16] SBCS: 0.5849466 [iso-8859-16] SBCS 0.66285837: [x-mac-ce] SBCS: 0.6628584 [x-mac-ce] SBCS 0.65958494: [ibm852] SBCS: 0.6595849 [ibm852] SBCS 0.7628341: [iso-8859-1] SBCS: 0.7628341 [iso-8859-1] SBCS 0.71730226: [iso-8859-4] SBCS: 0.7173023 [iso-8859-4] SBCS 0.71730226: [iso-8859-9] SBCS: 0.7173023 [iso-8859-9] SBCS 0.7628341: [iso-8859-13] SBCS: 0.7628341 [iso-8859-13] SBCS 0.71730226: [iso-8859-15] SBCS: 0.7173023 [iso-8859-15] SBCS 0.7628341: [windows-1252] SBCS: 0.7628341 [windows-1252] SBCS 0.76252055: [iso-8859-1] SBCS: 0.7625206 [iso-8859-1] SBCS inactive: [iso-8859-3] (i.e. confidence is too low). SBCS 0.76252055: [iso-8859-9] SBCS: 0.7625206 [iso-8859-9] SBCS 0.71700746: [iso-8859-15] SBCS: 0.7170075 [iso-8859-15] SBCS 0.76252055: [windows-1252] SBCS: 0.7625206 [windows-1252] SBCS 0.6695262: [windows-1250] SBCS: 0.6695262 [windows-1250] SBCS 0.6695262: [iso-8859-2] SBCS: 0.6695262 [iso-8859-2] SBCS 0.7052443: [iso-8859-13] SBCS: 0.7052443 [iso-8859-13] SBCS 0.6695262: [iso-8859-16] SBCS: 0.6695262 [iso-8859-16] SBCS 0.7587035: [x-mac-ce] SBCS: 0.7587035 [x-mac-ce] SBCS 0.7587035: [ibm852] SBCS: 0.7587035 [ibm852] SBCS 0.76380235: [windows-1252] SBCS: 0.7638023 [windows-1252] SBCS 0.76380235: [windows-1257] SBCS: 0.7638023 [windows-1257] SBCS 0.71821266: [iso-8859-4] SBCS: 0.7182127 [iso-8859-4] SBCS 0.76380235: [iso-8859-13] SBCS: 0.7638023 [iso-8859-13] SBCS 0.71821266: [iso-8859-15] SBCS: 0.7182127 [iso-8859-15] SBCS 0.6575037: [iso-8859-1] SBCS: 0.6575037 [iso-8859-1] SBCS 0.6575037: [iso-8859-9] SBCS: 0.6575037 [iso-8859-9] SBCS 0.61825883: [iso-8859-15] SBCS: 0.6182588 [iso-8859-15] SBCS 0.6575037: [windows-1252] SBCS: 0.6575037 [windows-1252] SBCS 0.7738685: [windows-1250] SBCS: 0.7738685 [windows-1250] SBCS 0.7738685: [iso-8859-2] SBCS: 0.7738685 [iso-8859-2] SBCS 0.7738685: [iso-8859-16] SBCS: 0.7738685 [iso-8859-16] SBCS 0.75962406: [ibm852] SBCS: 0.7596241 [ibm852] SBCS 0.66994256: [windows-1250] SBCS: 0.6699426 [windows-1250] SBCS 0.66994256: [iso-8859-2] SBCS: 0.6699426 [iso-8859-2] SBCS 0.66994256: [iso-8859-16] SBCS: 0.6699426 [iso-8859-16] SBCS 0.75917524: [x-mac-ce] SBCS: 0.7591752 [x-mac-ce] SBCS 0.75917524: [ibm852] SBCS: 0.7591752 [ibm852] SBCS 0.76376295: [iso-8859-1] SBCS: 0.763763 [iso-8859-1] SBCS 0.7181756: [iso-8859-4] SBCS: 0.7181756 [iso-8859-4] SBCS 0.76376295: [iso-8859-9] SBCS: 0.763763 [iso-8859-9] SBCS 0.7181756: [iso-8859-15] SBCS: 0.7181756 [iso-8859-15] SBCS 0.76376295: [windows-1252] SBCS: 0.763763 [windows-1252] SBCS Group found best match [windows-1250] confidence 0.7738685.
MBCS: Detected utf-8 with confidence of 0.7525 Get confidence: -- new match found: confidence 0.7525, index 0, charset utf-8. Get confidence done. MBCS Group Prober --------begin status MBCS 0.7525: [utf-8] MBCS 0.01: [shift-jis] MBCS 0.01: [euc-jp] MBCS 0.01: [gb18030] MBCS 0.01: [euc-kr] MBCS 0.01: [cp949] MBCS 0.01: [big5] MBCS inactive: euc-tw (i.e. confidence is too low). MBCS Group found best match [utf-8] confidence 0.7525.
Latin1Prober: Detected windows-1252 with confidence of 0.43269232 Latin1Prober: 0.43269232 [windows-1252]