Closed Spikhalskiy closed 9 years ago
JSON does NOT allow use of ISO-8859-1 as per specification; only UTF-8, UTF-16 and UTF-32 are supported. As a result, anything that does not look like UTF-16 or UTF-32 is taken to be UTF-8, with resulting decoding and error detection.
If you want to use non-conforming content like that (despite it not being legal JSON, strictly speaking), you will need to construct a Reader
explicitly.
@cowtowncoder Can you put a simple example How to extend or Construct this Reader
. I'm Developing also a json client with the data in ISO-8859-1 and I would like to parse it.
@ypriverol
JsonFactory jsonFactory = new JsonFactory();
JsonParser parser = jsonFactory.createParser(new InputStreamReader(inStream, charset));
@Thanks I will try to test this example, I think you should include a simple test in the library with it, because some windows-based servers provide JSON in ISO- and the current implementation of the JSON parser do not work. I will let you know if works
@Spikhalskiy have you try to use this inside a RestTemplate?
Not sure if this is relevant here, but Jackson unit test suite tries to ensure that both Reader
and InputStream
backends are equally well tested, as code used differs in many places.
Tests do not use ISO-8859-1 since this is technically invalid encoding (at least according to the original spec), but there is no real reason why use of UTF-8 backed reader would have meaningful differences.
@ypriverol I don't see any reasons why if it works for me in general, it shouldn't works in RestTemplate, but technically no. @cowtowncoder Very strange that you repudiate a necessity of supporting ISO-8859-1 by Jackson. People use Jackson not for specification and not to check if this JSON is valid or not, but for real life and it's common usage in project which HTTP and ISO-8859-1 as a standard encoding. So, Jackson should have support for it and related tests, but in any case Reader will convert ISO-8859-1 to UTF-16 and it's not a part of Jackson, so not too much for testing.
@Spikhalskiy You did not seem to understand what I am saying wrt tests. Actual usage as it must be done with Jackson is via Reader
, and Reader access is tested. To test it specifically with Latin-1 would only test JDK InputStreamReader
s decoding capabilities. I trust that to work just fine.
But I also think people really, really, REALLY should not use ISO-8859-1 for anything. I would go as far as to saying that only an idiot would consciously choose it as encoding at this time and day. I know there are idiots designing and building systems all over the place, but it is wrong to encourage bad practices. If I was younger and more idealistic I might even try to explicitly make usage more difficult.
If support for officially non-supported encoding (wrt JSON) is desired, there is always XML which supports a wide variety of encodings using EBCDIC.
@cowtowncoder I agree, but we have already many servers and projects responding with ISO-8859-1 and it's still a standard, so in our case - not we define charset, we should work with what we have. Not sure if library if it's not a full-stack framework should impose and propagate any architecture decisions and "religious" views in any case.
@cowtowncoder @Spikhalskiy I understand your point of not supporting ISO-8859-1 in the Jackson but my idea is to provide a couple of example cases about How to convert a full ISO-8859-1 json output to something like UTF-* I have been looking for the solution in many threads and nothing works and as this is a big true HTTP and ISO-8859-1 as a standard encoding. So, Jackson should have support for it and related tests, but in any case Reader will convert ISO-8859-1 to UTF-16 and it's not a part of Jackson, so not too much for testing.
@Spikhalskiy there are external libraries available that can do heuristic detection (and dynamic application) of trying to guess actual encoding, so if you have no encoding information available you will probably want to use that. This:
http://stackoverflow.com/questions/9181530/auto-detect-character-encoding-in-java
has reasonable discussion. Odd if there is no automatic Reader
wrapping over InputStreamReader
to automate the process.
@ypriverol I would be fully open for ORs for improved READMEs and/or javadocs
@cowtowncoder Thanks for link! But my current task fully solved by code that I provided above. I know the charset of stream, so there is no need to autodetect it. I just stuck in javadoc about autodetection which actually didn't work for me, reader with precise encoding solved my issue. I think it will be good to improve javadoc with list of encodings supported for autodetection, it's absolutely not obvious at least for me from current javadoc that ISO-8859-1 is not supported. I will prepare this minor PR maybe.
@cowtowncoder More than happy to provide you my solution, basically is download the InsputStream convert it to UTF and them pass the result to ResTTemplate and everything works perfectly. Yo can add this clearly to the javadoc and README for other users to find the solution quickly. What do you think?
@Spikhalskiy ah ok. Sorry for misreading and taking thread into wrong directions. Updates to javadocs make perfect sense.
@ypriverol javadoc makes sense; README too if there's appropriate place for that (if not, Wiki).
Hello,
JsonFactory#createParser(InputStream in) has the comment:
Looks like it doesn't work and now we have no ability to explicitly setup encoding.
Below I explain demo for stream with ISO-8859-1 content when our application encoding is UTF-8 (common for java).
Description: Hex dump - hex dump of valid json string payload encoded in ISO-8859-1, contains ã symbol, which has different encoding in UTF-8 and ISO-8859-1
As a result we will get
What's going on? We have sequence "ã_" in the stream, which is "E35F" in ISO-8859-1 If try to read this as UTF-8 - "E3" will be 1110 0011, which means that it's one 16-bytes char instead of 2 chars by 8 bytes, which is already incorrect + there is no char with E35F code in utf8 - exception.