Charset autodetection fail

Spikhalskiy commented 9 years ago

Hello,

JsonFactory#createParser(InputStream in) has the comment:

Note: no encoding argument is taken since it can always be auto-detected as suggested by JSON RFC.

Looks like it doesn't work and now we have no ability to explicitly setup encoding.

Below I explain demo for stream with ISO-8859-1 content when our application encoding is UTF-8 (common for java).

Description: Hex dump - hex dump of valid json string payload encoded in ISO-8859-1, contains ã symbol, which has different encoding in UTF-8 and ISO-8859-1

String hexDump = <hex dump >;
byte[] bytes = BaseEncoding.base16().decode(hexDump.toUpperCase()); //load hex dump to byte array
System.out.println("Try to read print it with UTF-8" + new String(bytes)); //we see ? - something can't be decoded
System.out.println("Better (ISO-8859-1): " + new String(bytes, Charset.forName("ISO-8859-1"))); //do it with right encoding and will get valid json dump in console

//trying to create json parser
JsonFactory jf = new JsonFactory();
JsonParser parser = jf.createParser(new ByteArrayInputStream(bytes));
System.out.println(parser.getClass().getSimpleName());  //it will be UTF8StreamJsonParser, which already seems to be incorrect
 while (parser.nextToken() != null) {}

As a result we will get

com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x5f
 at [Source: java.io.ByteArrayInputStream@373bc99b; line: 1, column: 764]
    at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581)
    at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3470)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3477)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipUtf8_3(UTF8StreamJsonParser.java:3353)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipString(UTF8StreamJsonParser.java:2547)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:690)

What's going on? We have sequence "ã_" in the stream, which is "E35F" in ISO-8859-1 If try to read this as UTF-8 - "E3" will be 1110 0011, which means that it's one 16-bytes char instead of 2 chars by 8 bytes, which is already incorrect + there is no char with E35F code in utf8 - exception.

cowtowncoder commented 9 years ago

JSON does NOT allow use of ISO-8859-1 as per specification; only UTF-8, UTF-16 and UTF-32 are supported. As a result, anything that does not look like UTF-16 or UTF-32 is taken to be UTF-8, with resulting decoding and error detection.

If you want to use non-conforming content like that (despite it not being legal JSON, strictly speaking), you will need to construct a Reader explicitly.

ypriverol commented 9 years ago

@cowtowncoder Can you put a simple example How to extend or Construct this Reader. I'm Developing also a json client with the data in ISO-8859-1 and I would like to parse it.

Spikhalskiy commented 9 years ago

@ypriverol

JsonFactory jsonFactory = new JsonFactory();
JsonParser parser = jsonFactory.createParser(new InputStreamReader(inStream, charset));

ypriverol commented 9 years ago

@Thanks I will try to test this example, I think you should include a simple test in the library with it, because some windows-based servers provide JSON in ISO- and the current implementation of the JSON parser do not work. I will let you know if works

ypriverol commented 9 years ago

@Spikhalskiy have you try to use this inside a RestTemplate?

cowtowncoder commented 9 years ago

Not sure if this is relevant here, but Jackson unit test suite tries to ensure that both Reader and InputStream backends are equally well tested, as code used differs in many places. Tests do not use ISO-8859-1 since this is technically invalid encoding (at least according to the original spec), but there is no real reason why use of UTF-8 backed reader would have meaningful differences.

Spikhalskiy commented 9 years ago

@ypriverol I don't see any reasons why if it works for me in general, it shouldn't works in RestTemplate, but technically no. @cowtowncoder Very strange that you repudiate a necessity of supporting ISO-8859-1 by Jackson. People use Jackson not for specification and not to check if this JSON is valid or not, but for real life and it's common usage in project which HTTP and ISO-8859-1 as a standard encoding. So, Jackson should have support for it and related tests, but in any case Reader will convert ISO-8859-1 to UTF-16 and it's not a part of Jackson, so not too much for testing.

cowtowncoder commented 9 years ago

@Spikhalskiy You did not seem to understand what I am saying wrt tests. Actual usage as it must be done with Jackson is via Reader, and Reader access is tested. To test it specifically with Latin-1 would only test JDK InputStreamReaders decoding capabilities. I trust that to work just fine.

But I also think people really, really, REALLY should not use ISO-8859-1 for anything. I would go as far as to saying that only an idiot would consciously choose it as encoding at this time and day. I know there are idiots designing and building systems all over the place, but it is wrong to encourage bad practices. If I was younger and more idealistic I might even try to explicitly make usage more difficult.

If support for officially non-supported encoding (wrt JSON) is desired, there is always XML which supports a wide variety of encodings using EBCDIC.

Spikhalskiy commented 9 years ago

@cowtowncoder I agree, but we have already many servers and projects responding with ISO-8859-1 and it's still a standard, so in our case - not we define charset, we should work with what we have. Not sure if library if it's not a full-stack framework should impose and propagate any architecture decisions and "religious" views in any case.

ypriverol commented 9 years ago

@cowtowncoder @Spikhalskiy I understand your point of not supporting ISO-8859-1 in the Jackson but my idea is to provide a couple of example cases about How to convert a full ISO-8859-1 json output to something like UTF-* I have been looking for the solution in many threads and nothing works and as this is a big true HTTP and ISO-8859-1 as a standard encoding. So, Jackson should have support for it and related tests, but in any case Reader will convert ISO-8859-1 to UTF-16 and it's not a part of Jackson, so not too much for testing.

cowtowncoder commented 9 years ago

@Spikhalskiy there are external libraries available that can do heuristic detection (and dynamic application) of trying to guess actual encoding, so if you have no encoding information available you will probably want to use that. This:

http://stackoverflow.com/questions/9181530/auto-detect-character-encoding-in-java

has reasonable discussion. Odd if there is no automatic Reader wrapping over InputStreamReader to automate the process.

@ypriverol I would be fully open for ORs for improved READMEs and/or javadocs

Spikhalskiy commented 9 years ago

@cowtowncoder Thanks for link! But my current task fully solved by code that I provided above. I know the charset of stream, so there is no need to autodetect it. I just stuck in javadoc about autodetection which actually didn't work for me, reader with precise encoding solved my issue. I think it will be good to improve javadoc with list of encodings supported for autodetection, it's absolutely not obvious at least for me from current javadoc that ISO-8859-1 is not supported. I will prepare this minor PR maybe.

ypriverol commented 9 years ago

@cowtowncoder More than happy to provide you my solution, basically is download the InsputStream convert it to UTF and them pass the result to ResTTemplate and everything works perfectly. Yo can add this clearly to the javadoc and README for other users to find the solution quickly. What do you think?

cowtowncoder commented 9 years ago

@Spikhalskiy ah ok. Sorry for misreading and taking thread into wrong directions. Updates to javadocs make perfect sense.

@ypriverol javadoc makes sense; README too if there's appropriate place for that (if not, Wiki).

FasterXML / jackson-core

Charset autodetection fail #222