CharsetDetector / UTF-unknown

Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+
313 stars 47 forks source link

Wrong identification for windows-1252 #42

Open msdobrescu opened 5 years ago

msdobrescu commented 5 years ago

Hello, I try to identify the encoding of a file that should be windows-1252, but it finds a better match for windows-1255. my.txt It contains, for instance, C5, which is Å, but the file is identified as windows-1255, which does not contain it at all.

msdobrescu commented 5 years ago

It seems the uchardet has an improved detection, so it must be imported into this project.

304NotModified commented 5 years ago

Do you mean the Mozilla Universal Charset Detector.? Too bad this is a refactor of a port, reporting could be a lot of work

msdobrescu commented 5 years ago

Possibly, but it's almost unusable in my case. And there is no port of uchardet, which is now freedesktop's. Look here: https://gitlab.freedesktop.org/uchardet/uchardet. Worth adding more to it, too bad it is a bit too hardcoded.

msdobrescu commented 5 years ago

Would you accept some new languages ported from the uchardet project?

rstm-sf commented 4 years ago

Now, v2.3.0

Detected encoding iso-8859-1 with confidence 0.47388184.

From Status Log:

SBCS 0.47388184: [iso-8859-1] SBCS: 0.4738818 [iso-8859-1]

SBCS 0.47388184: [iso-8859-4] SBCS: 0.4738818 [iso-8859-4]

SBCS 0.47388184: [iso-8859-9] SBCS: 0.4738818 [iso-8859-9]

SBCS 0.47388184: [iso-8859-13] SBCS: 0.4738818 [iso-8859-13]

SBCS 0.47388184: [iso-8859-15] SBCS: 0.4738818 [iso-8859-15]

SBCS 0.47388184: [windows-1252] SBCS: 0.4738818 [windows-1252]

Status Log Get confidence: -- new match found: confidence 0.020249203, index 0, charset windows-1251. -- new match found: confidence 0.026152553, index 6, charset iso-8859-7. -- new match found: confidence 0.04902641, index 11, charset windows-1255. -- new match found: confidence 0.050912045, index 12, charset windows-1255. -- new match found: confidence 0.093243085, index 15, charset iso-8859-1. -- new match found: confidence 0.09324489, index 18, charset iso-8859-1. -- new match found: confidence 0.14311144, index 21, charset iso-8859-2. -- new match found: confidence 0.1763441, index 32, charset iso-8859-15. -- new match found: confidence 0.24882, index 45, charset iso-8859-3. -- new match found: confidence 0.3023013, index 59, charset ibm852. -- new match found: confidence 0.47388184, index 60, charset iso-8859-1. Get confidence done. SBCS Group Prober --------begin status SBCS 0.020249203: [windows-1251] SBCS: 0.0202492 [windows-1251] SBCS 0.014343185: [koi8-r] SBCS: 0.01434319 [koi8-r] SBCS 0: [iso-8859-5] SBCS: 0.00 [iso-8859-5] SBCS 0.020249203: [x-mac-cyrillic] SBCS: 0.0202492 [x-mac-cyrillic] SBCS 0: [ibm866] SBCS: 0.00 [ibm866] SBCS 0.00659974: [ibm855] SBCS: 0.00659974 [ibm855] SBCS 0.026152553: [iso-8859-7] SBCS: 0.02615255 [iso-8859-7] SBCS 0.026152553: [windows-1253] SBCS: 0.02615255 [windows-1253] SBCS 0: [iso-8859-5] SBCS: 0.00 [iso-8859-5] SBCS 0.0031166344: [windows-1251] SBCS: 0.003116634 [windows-1251] SBCS 0: [windows-1255] HEB: 0 - 0 [Logical-Visual score] SBCS 0.04902641: [windows-1255] SBCS: 0.04902641 [windows-1255] SBCS 0.050912045: [windows-1255] SBCS: 0.05091204 [windows-1255] SBCS 0.013214358: [tis-620] SBCS: 0.01321436 [tis-620] SBCS 0.013214358: [iso-8859-11] SBCS: 0.01321436 [iso-8859-11] SBCS 0.093243085: [iso-8859-1] SBCS: 0.09324308 [iso-8859-1] SBCS 0.093243085: [iso-8859-15] SBCS: 0.09324308 [iso-8859-15] SBCS 0.093243085: [windows-1252] SBCS: 0.09324308 [windows-1252] SBCS 0.09324489: [iso-8859-1] SBCS: 0.09324489 [iso-8859-1] SBCS 0.09324489: [iso-8859-15] SBCS: 0.09324489 [iso-8859-15] SBCS 0.09324489: [windows-1252] SBCS: 0.09324489 [windows-1252] SBCS 0.14311144: [iso-8859-2] SBCS: 0.1431114 [iso-8859-2] SBCS 0.14311144: [windows-1250] SBCS: 0.1431114 [windows-1250] SBCS 0.12198714: [iso-8859-1] SBCS: 0.1219871 [iso-8859-1] SBCS 0.12198714: [windows-1252] SBCS: 0.1219871 [windows-1252] SBCS 0.09350189: [iso-8859-3] SBCS: 0.09350189 [iso-8859-3] SBCS 0.14065312: [iso-8859-3] SBCS: 0.1406531 [iso-8859-3] SBCS 0.14065312: [iso-8859-9] SBCS: 0.1406531 [iso-8859-9] SBCS inactive: [iso-8859-6] (i.e. confidence is too low). SBCS 0: [windows-1256] SBCS: 0.00 [windows-1256] SBCS 0.084189065: [viscii] SBCS: 0.08418906 [viscii] SBCS 0.057199046: [windows-1258] SBCS: 0.05719905 [windows-1258] SBCS 0.1763441: [iso-8859-15] SBCS: 0.1763441 [iso-8859-15] SBCS 0.1763441: [iso-8859-1] SBCS: 0.1763441 [iso-8859-1] SBCS 0.1763441: [windows-1252] SBCS: 0.1763441 [windows-1252] SBCS 0.09554723: [iso-8859-13] SBCS: 0.09554723 [iso-8859-13] SBCS 0.09554723: [iso-8859-10] SBCS: 0.09554723 [iso-8859-10] SBCS 0.09554723: [iso-8859-4] SBCS: 0.09554723 [iso-8859-4] SBCS 0.09578463: [iso-8859-13] SBCS: 0.09578463 [iso-8859-13] SBCS 0.09578463: [iso-8859-10] SBCS: 0.09578463 [iso-8859-10] SBCS 0.09578463: [iso-8859-4] SBCS: 0.09578463 [iso-8859-4] SBCS 0.09340608: [iso-8859-1] SBCS: 0.09340608 [iso-8859-1] SBCS 0.09340608: [iso-8859-9] SBCS: 0.09340608 [iso-8859-9] SBCS 0.09340608: [iso-8859-15] SBCS: 0.09340608 [iso-8859-15] SBCS 0.09340608: [windows-1252] SBCS: 0.09340608 [windows-1252] SBCS 0.24882: [iso-8859-3] SBCS: 0.24882 [iso-8859-3] SBCS 0.095001444: [windows-1250] SBCS: 0.09500144 [windows-1250] SBCS 0.095001444: [iso-8859-2] SBCS: 0.09500144 [iso-8859-2] SBCS 0.13669409: [x-mac-ce] SBCS: 0.1366941 [x-mac-ce] SBCS 0.1854423: [ibm852] SBCS: 0.1854423 [ibm852] SBCS 0.081335865: [windows-1250] SBCS: 0.08133586 [windows-1250] SBCS 0.081335865: [iso-8859-2] SBCS: 0.08133586 [iso-8859-2] SBCS 0.13743466: [x-mac-ce] SBCS: 0.1374347 [x-mac-ce] SBCS 0.1760888: [ibm852] SBCS: 0.1760888 [ibm852] SBCS 0.13817607: [windows-1250] SBCS: 0.1381761 [windows-1250] SBCS 0.13817607: [iso-8859-2] SBCS: 0.1381761 [iso-8859-2] SBCS 0.13817607: [iso-8859-13] SBCS: 0.1381761 [iso-8859-13] SBCS 0.12337148: [iso-8859-16] SBCS: 0.1233715 [iso-8859-16] SBCS 0.21631232: [x-mac-ce] SBCS: 0.2163123 [x-mac-ce] SBCS 0.3023013: [ibm852] SBCS: 0.3023013 [ibm852] SBCS 0.47388184: [iso-8859-1] SBCS: 0.4738818 [iso-8859-1] SBCS 0.47388184: [iso-8859-4] SBCS: 0.4738818 [iso-8859-4] SBCS 0.47388184: [iso-8859-9] SBCS: 0.4738818 [iso-8859-9] SBCS 0.47388184: [iso-8859-13] SBCS: 0.4738818 [iso-8859-13] SBCS 0.47388184: [iso-8859-15] SBCS: 0.4738818 [iso-8859-15] SBCS 0.47388184: [windows-1252] SBCS: 0.4738818 [windows-1252] SBCS 0.13686267: [iso-8859-1] SBCS: 0.1368627 [iso-8859-1] SBCS 0.13686267: [iso-8859-3] SBCS: 0.1368627 [iso-8859-3] SBCS 0.13686267: [iso-8859-9] SBCS: 0.1368627 [iso-8859-9] SBCS 0.13686267: [iso-8859-15] SBCS: 0.1368627 [iso-8859-15] SBCS 0.13686267: [windows-1252] SBCS: 0.1368627 [windows-1252] SBCS 0.08758995: [windows-1250] SBCS: 0.08758995 [windows-1250] SBCS 0.08758995: [iso-8859-2] SBCS: 0.08758995 [iso-8859-2] SBCS 0.08758995: [iso-8859-13] SBCS: 0.08758995 [iso-8859-13] SBCS 0.08798097: [iso-8859-16] SBCS: 0.08798097 [iso-8859-16] SBCS 0.12607843: [x-mac-ce] SBCS: 0.1260784 [x-mac-ce] SBCS 0.16955028: [ibm852] SBCS: 0.1695503 [ibm852] SBCS 0.37495747: [windows-1252] SBCS: 0.3749575 [windows-1252] SBCS 0.37495747: [windows-1257] SBCS: 0.3749575 [windows-1257] SBCS 0.37495747: [iso-8859-4] SBCS: 0.3749575 [iso-8859-4] SBCS 0.37495747: [iso-8859-13] SBCS: 0.3749575 [iso-8859-13] SBCS 0.37495747: [iso-8859-15] SBCS: 0.3749575 [iso-8859-15] SBCS 0.093210384: [iso-8859-1] SBCS: 0.09321038 [iso-8859-1] SBCS 0.093210384: [iso-8859-9] SBCS: 0.09321038 [iso-8859-9] SBCS 0.093210384: [iso-8859-15] SBCS: 0.09321038 [iso-8859-15] SBCS 0.093210384: [windows-1252] SBCS: 0.09321038 [windows-1252] SBCS 0.09317723: [windows-1250] SBCS: 0.09317723 [windows-1250] SBCS 0.09317723: [iso-8859-2] SBCS: 0.09317723 [iso-8859-2] SBCS 0.09317723: [iso-8859-16] SBCS: 0.09317723 [iso-8859-16] SBCS 0.18036576: [ibm852] SBCS: 0.1803658 [ibm852] SBCS 0.09312218: [windows-1250] SBCS: 0.09312218 [windows-1250] SBCS 0.09312218: [iso-8859-2] SBCS: 0.09312218 [iso-8859-2] SBCS 0.09312218: [iso-8859-16] SBCS: 0.09312218 [iso-8859-16] SBCS 0.13316554: [x-mac-ce] SBCS: 0.1331655 [x-mac-ce] SBCS 0.18025918: [ibm852] SBCS: 0.1802592 [ibm852] SBCS 0.23395953: [iso-8859-1] SBCS: 0.2339595 [iso-8859-1] SBCS 0.23395953: [iso-8859-4] SBCS: 0.2339595 [iso-8859-4] SBCS 0.23395953: [iso-8859-9] SBCS: 0.2339595 [iso-8859-9] SBCS 0.23395953: [iso-8859-15] SBCS: 0.2339595 [iso-8859-15] SBCS 0.23395953: [windows-1252] SBCS: 0.2339595 [windows-1252] SBCS Group found best match [iso-8859-1] confidence 0.47388184.

This is consistent with the Finnish model: https://github.com/CharsetDetector/UTF-unknown/blob/d52af8ddb903a4b72d18a564473069e323818b82/src/Core/Probers/SBCSGroupProber.cs#L197-L203

Now the problem is the same as in #77