arabindamoni / wikixmlj

Automatically exported from code.google.com/p/wikixmlj
0 stars 0 forks source link

unicode characters #8

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Parse the XML file,
2. Find that the Unicode encoding is plain wrong

Not sure how this wasn't noticed as a serious error before?
I'm not sure what I'm doing differently, that requires this, and how it
would work for others?  Strange.

For me, adding "UTF8" as the encoding of the InputStreamReader, fixed
everything so the Unicode characters are read in correctly.

    protected InputSource getInputSource() throws Exception
    {
        BufferedReader br = null;

        if(wikiXMLFile.endsWith(".gz")) {
            br = new BufferedReader(new InputStreamReader(
                    new GZIPInputStream(new FileInputStream(wikiXMLFile)), "UTF8"));
        } else if(wikiXMLFile.endsWith(".bz2")) {
            FileInputStream fis = new FileInputStream(wikiXMLFile);
            byte [] ignoreBytes = new byte[2];
            fis.read(ignoreBytes); //"B", "Z" bytes from commandline tools
            br = new BufferedReader(new InputStreamReader(
                    new CBZip2InputStream(fis), "UTF8"));
        } else {
            br = new BufferedReader(new InputStreamReader(
                new FileInputStream(wikiXMLFile), "UTF8"));
        }

        return new InputSource(br);
    }

Original issue reported on code.google.com by ianupri...@gmail.com on 1 Apr 2010 at 4:37