eclipse-archived / unide

Eclipse Public License 1.0
29 stars 17 forks source link

Set UTF-8 as string encoding format for all PPMP message types #23

Closed muelsen closed 6 years ago

muelsen commented 6 years ago

In order to avoid incompatibilities of PPMP messages because the message receiver is using a different format than the sender, we should set UTF-8 as standard for PPMP.

fpatz commented 6 years ago

Shouldn't we simply rely on existing JSON's standards? According to https://www.ietf.org/rfc/rfc4627.txt, JSON is ok with any UTF encoding. Most people probably use UTF-8 anyway, but to me enforcing the encoding on the schema level on top of JSON itself feels like the wrong layer.

ameinhardt commented 6 years ago

@muelsen: did you mean UTF -8 in particular or just UTF? rfc4627 defines Unicode encoding and favors UTF-8. Could json even be encoded in CESU? Maybe we can ignore that. I have never seen that.

muelsen commented 6 years ago

Hi @ameinhardt , @fpatz , Currently we are expecting UTF-8 in our interface. How can I determine wether it is UTF-8 or -16 or... when I receive a message?

ameinhardt commented 6 years ago

@muelsen: have you checked rfc4627, section 3? The first two characters of a json string are always ASCII, so the first 4 bytes max would tell. Jackson should be able to handle that for you.

muelsen commented 6 years ago

@ameinhardt I checked that, but how does it look like? Are these characters placed in front of the JSON-object? Do you have an example for that?

ameinhardt commented 6 years ago

Because of the json grammar the first two characters, part of the JSON String. Roughly: JSON-text = object / array value = false / null / true / object / array / number / string object = {\s string ... array = [\s value ...

so the first two characters would always be bracket + something that is in ascii range.

bgusach commented 6 years ago

@muelsen if we stick to rfc4627, the first two chars must be ASCII (in our case, {", but beware of whitespaces) so you can test the first 4 bytes of the stream and see if it is utf-8, utf-16 or utf-32.

As a side note, there are unfortunately many JSON specifications, and according to some, the rule of the first two characters being ASCII does not hold true (e.g. "ñaaa" is valid JSON according to rfc7158). But since our payload has an object as a root node, we won't be running into that issue.

I would suggest that we define in the spec the standard of JSON we are using, and optionally mentioning which encodings are allowed by this standard.

The first chapter of this article is an interesting read: http://seriot.ch/parsing_json.php

bgusach commented 6 years ago

BTW @muelsen, I think you shouldn't care about these details unless you are building your own JSON parser...

ameinhardt commented 6 years ago

I think we should stick to the most recent json definition. RFC8259 obsoletes the other rfcs above. It's true that it doesn't have the JSON-text = object /array restriction, but JSON-text can be all values. Nevertheless, all PPMP documents are objects, so the first 4 bytes should still identify the UTF encoding. One issue could be bom. I suggest we also handle that according to the above RFC:

Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

In short: PPMP is JSON (as currently defined in RFC8259). For best interoperability, UTF-8 without bom is preferred.

bgusach commented 6 years ago

@ameinhardt,

[ameinhardt] I think we should stick to the most recent json definition

Right, but to be concise: to the most recent RFC JSON Specs, since there are at least 3 non-RFC JSON specifications.

[ameinhardt] For best interoperability, UTF-8 without bom is preferred.

It is not preferred, it is actually mandatory. Maybe this is a German-English lost in translation, but MUST NOT means that it is not allowed to happen. JSON encoded in UTF-8-BOM is broken JSON. If some parsers want to accept it, it's up to them.

Then.... actually, the RFC8259 is pretty strict about the encoding. From the RFC doc:

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8.

In short: we state that we stick to RFC8259, which means: UTF-8. No UTF-8-BOM, no UTF-16, no UTF-32.

ameinhardt commented 6 years ago

even better, thanks for the clarification @bgusach! I'll put that into the faq