Open msdobrescu opened 5 years ago
It seems the uchardet has an improved detection, so it must be imported into this project.
Do you mean the Mozilla Universal Charset Detector.? Too bad this is a refactor of a port, reporting could be a lot of work
Possibly, but it's almost unusable in my case. And there is no port of uchardet, which is now freedesktop's. Look here: https://gitlab.freedesktop.org/uchardet/uchardet. Worth adding more to it, too bad it is a bit too hardcoded.
Would you accept some new languages ported from the uchardet project?
Now, v2.3.0
Detected encoding iso-8859-1 with confidence 0.47388184.
From Status Log:
SBCS 0.47388184: [iso-8859-1] SBCS: 0.4738818 [iso-8859-1]
SBCS 0.47388184: [iso-8859-4] SBCS: 0.4738818 [iso-8859-4]
SBCS 0.47388184: [iso-8859-9] SBCS: 0.4738818 [iso-8859-9]
SBCS 0.47388184: [iso-8859-13] SBCS: 0.4738818 [iso-8859-13]
SBCS 0.47388184: [iso-8859-15] SBCS: 0.4738818 [iso-8859-15]
SBCS 0.47388184: [windows-1252] SBCS: 0.4738818 [windows-1252]
This is consistent with the Finnish model: https://github.com/CharsetDetector/UTF-unknown/blob/d52af8ddb903a4b72d18a564473069e323818b82/src/Core/Probers/SBCSGroupProber.cs#L197-L203
Now the problem is the same as in #77
Hello, I try to identify the encoding of a file that should be windows-1252, but it finds a better match for windows-1255. my.txt It contains, for instance, C5, which is Å, but the file is identified as windows-1255, which does not contain it at all.