Open GoogleCodeExporter opened 9 years ago
Using Tika's AutoDetectReader would be an elegant way of doing it.
Original comment by digitalpebble
on 19 Jan 2015 at 5:00
We should use [https://code.google.com/p/forbidden-apis/] to detect such cases.
Will open a new issue for that.
Original comment by digitalpebble
on 23 Jan 2015 at 9:05
Note to self : for the processXml(URL sitemapUrl, byte[] xmlContent) method we
should do :
BOMInputStream bomIs = new BOMInputStream(new ByteArrayInputStream(xmlContent));
InputSource is = new InputSource();
is.setByteStream(bomIs);
// no knowledge of encoding -> let SAX guess if from the XML header and assume it is correct
return processXml(sitemapUrl, is);
as passing a stream will let SAX detect the encoding from the XML header. Might
not always be correct but at least it will be better than constructing a String
with the default Locale.
For processText() we will need to let the user specify a charset (e.g. the one
returned by the server) and/or detect the correct encoding.
Original comment by digitalpebble
on 23 Jan 2015 at 3:16
In theory [http://www.sitemaps.org/protocol.html#escaping] the files should be
in UTF-8 with the content URL-escaped. Maybe we should simply ignore such cases
even if they are found in practice.
Note : it would still be better not to rely on the default charset and let SAX
detect the charset specified in the header.
Original comment by digitalpebble
on 23 Jan 2015 at 4:31
I've modified the code to use UTF-8 as the charset, when processing the byte
array. It's a trivial change, but I don't know of an easy way to test this
(force default charset to be explicitly not UTF-8), so that we could then test
the results of parsing something with text that's not just 7-bit ascii. Any
ideas?
Original comment by kkrugler...@transpac.com
on 26 Jan 2015 at 1:34
Trivial patch as per above
Original comment by kkrugler...@transpac.com
on 26 Jan 2015 at 1:35
Attachments:
Using StandardCharsets.UTF_8 is a nicer approach : it avoids a lookup based on
the String representation of the charset but also does not throw a
UnsupportedEncodingException, which makes the code a bit nicer.
For the XML files, what about using is.setByteStream(bomIs) instead? It would
use the charset defined in the XML header, which is better than forcing to
UTF-8 (even though that's what it should be). For text, let's just assume UTF-8
indeed.
Tests : Could have documents in ISO-1 with incompatible symbols.
Original comment by digitalpebble
on 26 Jan 2015 at 6:14
I think StandardCharsets is Java 1.7, but our target build is 1.6. So we'd need
to make the decision to bail on Java 1.6 compatibility first.
Re using the setByteStream call - seems fine, only oddity is that then a text
file is forced to be UTF-8, but an XML file could be something else.
Re tests - my question was how to validate that even if the locale had a
default charset of say CP-1252, a properly encoded UTF-8 text file would be
parsed properly. Having someone on Windows validate that the test fails w/o the
fix is a start, but it's not a proper test.
Original comment by kkrugler...@transpac.com
on 26 Jan 2015 at 6:41
Original issue reported on code.google.com by
digitalpebble
on 19 Jan 2015 at 4:51