albfernandez / juniversalchardet

Originally exported from code.google.com/p/juniversalchardet
Other
333 stars 59 forks source link

Always added 'ZWNBSP' symbol at the beginning of UTF-16BE encoded file. #41

Closed TheBoringDev closed 2 years ago

TheBoringDev commented 3 years ago

First, I have a JSON file and its encoding is UTF-16BE as shown below.

image

Next, I use the sample code to read that JSON file content as an array byte to detect its encoding and convert the array bytes back to the original string based on that detected encoding.

public static void main(String[] args) throws IOException {
        byte[] buf = new byte[4096];
        java.io.InputStream fis = java.nio.file.Files.newInputStream(java.nio.file.Paths.get("UTF16BE.json"));
        UniversalDetector detector = new UniversalDetector();

        int nread;
        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }

        detector.dataEnd();

        String encoding = detector.getDetectedCharset();       

        detector.reset();

        Charset charset = Charset.forName(encoding);
        String out = new String(buf, charset);
        System.out.printf(out);
    }

But, the output string is added a weird symbol "ZWNBSP" and I have no idea what it is. I am expecting it should not be there. image

I can replicate this issue with UTF 16BE and UTF 16-LE, but cannot replicate it with ANSI and UTF8.

TheBoringDev commented 3 years ago

I have discussed with professor Google and ZWNBSP is Zero Width Space Unicode character. It seems to this character is always added at the beginning of the encoded Unicode string (?)

albfernandez commented 3 years ago

all utf 16 files may start with 0xFEFF or 0XFFFE [-2, -1] or [-1,-2] This is what your editor is showing (as the unused buff bytes at the end as 'NUL') You may remove the two initial bytes and the nulls at the end in this concrete case.

You can also use org.mozilla.universalchardet.ReaderFactory.createBufferedReader(file) It will detect the charset and create a reader that skip BOM bytes if present in UTF files