albfernandez / juniversalchardet

Originally exported from code.google.com/p/juniversalchardet
Other
333 stars 59 forks source link

ISO-8859-1 detected incorrectly (Hebrew or Thai) ! #22

Closed sxilderik closed 6 years ago

sxilderik commented 6 years ago

Hello I did a simple basic test, and I’m not very happy with the result. What am I doing wrong?

In pom.xml:

    <dependencies>
        <!-- https://mvnrepository.com/artifact/com.github.albfernandez/juniversalchardet -->
        <dependency>
            <groupId>com.github.albfernandez</groupId>
            <artifactId>juniversalchardet</artifactId>
            <version>2.1.0</version>
        </dependency>
    </dependencies>

In source code:

Basically, I encode a String into an array of bytes with a given Charset. I then use UniversalDetector to guess the charset used. I’m lenient, I don’t expect the exact Charset, but I least I expect a Charset which can successfully encode and decode the initial string giving back that string! It fails this simple test, as "àéèÇ" encoded in iso-8859-1 is guessed as Hebrew (Windows-1255), and "aeaCàéèÇ" as Thai (TIS-620), none of those Charsets having those accented chars in them!

    @Test
    public void test_decodeBytes() {

        final String string = "aeaCàêäÇ";
        Charset s;
        byte[] bytes;
        try {
            bytes = string.getBytes(StandardCharsets.ISO_8859_1);
            s = this.guessCharset(bytes); // detected charset = TIS-620, Thai charset ???!!!
            Assert.assertEquals(string, new String(string.getBytes(s), s)); // FAILS of course !

            bytes = string.getBytes(StandardCharsets.UTF_8);
            s = this.guessCharset(bytes);
            Assert.assertEquals(string, new String(string.getBytes(s), s)); // SUCCESS
        } catch (final UnsupportedEncodingException e) {
            e.printStackTrace();
        }
    }

    private Charset guessCharset(final byte[] bytes) {

        final UniversalDetector detector = new UniversalDetector();
        detector.handleData(bytes, 0, bytes.length);
        detector.dataEnd();
        return Charset.forName(detector.getDetectedCharset());
    }
sxilderik commented 6 years ago

Actually, I realize that it‘s a different task guessing the charset used in an array of bytes when you know the original string, than when you don’t. Or at least it’s easier to be unhappy about the result!

For example, "àéè".getBytes("ISO-8859-1") gives [160, 151, 152] This array of bytes can very well be interpreted as Hebrew, giving "איטַ".

new String("àéè".getBytes("ISO-8859-1"), "Windows-1255");
     (java.lang.String) איט

This very short array of bytes can be successfully decoded using a wide variety of Charsets, you have now way of knowing which one was used in the first place.

So you pick one. I don’t know why Hebrew or Thai are picked first, but they are legit.

An improvement may be not to pick just one, but return all Charsets that have passed successfully the decoding check. The task of picking one would be left to the caller, not the callee...

Your code is working as intended, it’s me who put too much hope in it and did not realize what I hoped for was in fact impossible.

albfernandez commented 6 years ago

Hi

Latin-1 detection (windows-1252 / ISO-8859-1), is detected by a statistical analysis, so your code is confused, too much accentuated characters. With a simpler example (more realistics) as "Château" works fine.

In short, the analyser think the data is impossible to be latin because it has too much accentuated chars... It gives windows-1252 the minimum weight 0.0 beacuse of that. So your suggestion of return a list of possible detected charsets would not work :(

albfernandez commented 6 years ago
    // Tets case for https://github.com/albfernandez/juniversalchardet/issues/22
    // With less accute characters, it's improved detection
    @Test
    public void testDecodeBytesBetterStats() {

        final String string = "Château";
        Charset s;
        byte[] bytes;

        bytes = string.getBytes(StandardCharsets.UTF_8);
        s = this.guessCharset(bytes);
        Assert.assertEquals(string, new String(string.getBytes(s), s)); // SUCCESS

        bytes = string.getBytes(StandardCharsets.ISO_8859_1);
        s = this.guessCharset(bytes); 
        Assert.assertEquals(string, new String(string.getBytes(s), s)); // SUCCESS
    }

private Charset guessCharset(final byte[] bytes) {
    final UniversalDetector detector = new UniversalDetector();
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    return Charset.forName(detector.getDetectedCharset());
}
albfernandez commented 6 years ago

Also, detecting encoding in short data is harder than detencting in large file, so it's more error prone.