Closed sxilderik closed 6 years ago
Actually, I realize that it‘s a different task guessing the charset used in an array of bytes when you know the original string, than when you don’t. Or at least it’s easier to be unhappy about the result!
For example, "àéè".getBytes("ISO-8859-1")
gives [160, 151, 152]
This array of bytes can very well be interpreted as Hebrew, giving "איטַ"
.
new String("àéè".getBytes("ISO-8859-1"), "Windows-1255");
(java.lang.String) איט
This very short array of bytes can be successfully decoded using a wide variety of Charsets, you have now way of knowing which one was used in the first place.
So you pick one. I don’t know why Hebrew or Thai are picked first, but they are legit.
An improvement may be not to pick just one, but return all Charsets that have passed successfully the decoding check. The task of picking one would be left to the caller, not the callee...
Your code is working as intended, it’s me who put too much hope in it and did not realize what I hoped for was in fact impossible.
Hi
Latin-1 detection (windows-1252 / ISO-8859-1), is detected by a statistical analysis, so your code is confused, too much accentuated characters. With a simpler example (more realistics) as "Château" works fine.
In short, the analyser think the data is impossible to be latin because it has too much accentuated chars... It gives windows-1252 the minimum weight 0.0 beacuse of that. So your suggestion of return a list of possible detected charsets would not work :(
// Tets case for https://github.com/albfernandez/juniversalchardet/issues/22
// With less accute characters, it's improved detection
@Test
public void testDecodeBytesBetterStats() {
final String string = "Château";
Charset s;
byte[] bytes;
bytes = string.getBytes(StandardCharsets.UTF_8);
s = this.guessCharset(bytes);
Assert.assertEquals(string, new String(string.getBytes(s), s)); // SUCCESS
bytes = string.getBytes(StandardCharsets.ISO_8859_1);
s = this.guessCharset(bytes);
Assert.assertEquals(string, new String(string.getBytes(s), s)); // SUCCESS
}
private Charset guessCharset(final byte[] bytes) {
final UniversalDetector detector = new UniversalDetector();
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
return Charset.forName(detector.getDetectedCharset());
}
Also, detecting encoding in short data is harder than detencting in large file, so it's more error prone.
Hello I did a simple basic test, and I’m not very happy with the result. What am I doing wrong?
In pom.xml:
In source code:
Basically, I encode a String into an array of bytes with a given Charset. I then use UniversalDetector to guess the charset used. I’m lenient, I don’t expect the exact Charset, but I least I expect a Charset which can successfully encode and decode the initial string giving back that string! It fails this simple test, as
"àéèÇ"
encoded in iso-8859-1 is guessed as Hebrew (Windows-1255), and"aeaCàéèÇ"
as Thai (TIS-620), none of those Charsets having those accented chars in them!