[Sitemaps] Parser assumes file encoding is the same as the one used by default on JVM

amir-jakoby / crawler-commons

Automatically exported from code.google.com/p/crawler-commons

0 stars 0 forks source link

[Sitemaps] Parser assumes file encoding is the same as the one used by default on JVM #63

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago

SiteMapParser creates Readers from the byte[] content by assuming that the file 
encoding is the same as the one use by default on the JVM. 

We should use Tika's charset detector (or ICU4J directly) to detect the charset 
used prior to creating the Readers.

Will attach a test class shortly.

Original issue reported on code.google.com by digitalpebble on 19 Jan 2015 at 4:51

GoogleCodeExporter commented 8 years ago

Using Tika's AutoDetectReader would be an elegant way of doing it.

Original comment by digitalpebble on 19 Jan 2015 at 5:00

GoogleCodeExporter commented 8 years ago

We should use [https://code.google.com/p/forbidden-apis/] to detect such cases. 
Will open a new issue for that.

Original comment by digitalpebble on 23 Jan 2015 at 9:05

GoogleCodeExporter commented 8 years ago

Note to self : for the processXml(URL sitemapUrl, byte[] xmlContent) method we 
should do :

        BOMInputStream bomIs = new BOMInputStream(new ByteArrayInputStream(xmlContent));
        InputSource is = new InputSource();
        is.setByteStream(bomIs);
        // no knowledge of encoding -> let SAX guess if from the XML header and assume it is correct
        return processXml(sitemapUrl, is);

as passing a stream will let SAX detect the encoding from the XML header. Might 
not always be correct but at least it will be better than constructing a String 
with the default Locale.

For processText() we will need to let the user specify a charset (e.g. the one 
returned by the server) and/or detect the correct encoding.

Original comment by digitalpebble on 23 Jan 2015 at 3:16

GoogleCodeExporter commented 8 years ago

In theory [http://www.sitemaps.org/protocol.html#escaping] the files should be 
in UTF-8 with the content URL-escaped. Maybe we should simply ignore such cases 
even if they are found in practice.

Note : it would still be better not to rely on the default charset and let SAX 
detect the charset specified in the header.

Original comment by digitalpebble on 23 Jan 2015 at 4:31

GoogleCodeExporter commented 8 years ago

I've modified the code to use UTF-8 as the charset, when processing the byte 
array. It's a trivial change, but I don't know of an easy way to test this 
(force default charset to be explicitly not UTF-8), so that we could then test 
the results of parsing something with text that's not just 7-bit ascii. Any 
ideas?

Original comment by kkrugler...@transpac.com on 26 Jan 2015 at 1:34

GoogleCodeExporter commented 8 years ago

Trivial patch as per above

Original comment by kkrugler...@transpac.com on 26 Jan 2015 at 1:35

Attachments:

crawler-commons-63.patch

GoogleCodeExporter commented 8 years ago

Using StandardCharsets.UTF_8 is a nicer approach : it avoids a lookup based on 
the String representation of the charset but also does not throw a 
UnsupportedEncodingException, which makes the code a bit nicer.

For the XML files, what about using is.setByteStream(bomIs) instead? It would 
use the charset defined in the XML header, which is better than forcing to 
UTF-8 (even though that's what it should be). For text, let's just assume UTF-8 
indeed.

Tests : Could have documents in ISO-1 with incompatible symbols.

Original comment by digitalpebble on 26 Jan 2015 at 6:14

GoogleCodeExporter commented 8 years ago

I think StandardCharsets is Java 1.7, but our target build is 1.6. So we'd need 
to make the decision to bail on Java 1.6 compatibility first.

Re using the setByteStream call - seems fine, only oddity is that then a text 
file is forced to be UTF-8, but an XML file could be something else.

Re tests - my question was how to validate that even if the locale had a 
default charset of say CP-1252, a properly encoded UTF-8 text file would be 
parsed properly. Having someone on Windows validate that the test fails w/o the 
fix is a start, but it's not a proper test.

Original comment by kkrugler...@transpac.com on 26 Jan 2015 at 6:41