create better separation of concerns for unicode encoding and json formatting

chrisspre commented 1 year ago

The existing option "escapeNonAscii" seems to be a quick solution for some corner cases but blurs the difference between encoding unicode text into UTF-8 and formatting JSON content.

The JSON standard requires UTF-8 https://datatracker.ietf.org/doc/html/rfc8259#section-8.1 and requires only quotation mark, reverse solidus, and the control characters to be escaped https://datatracker.ietf.org/doc/html/rfc8259. JSON libraries that implement the standard should not need to have any more characters escaped other than the ones required.

So, the configuration escapeNonAscii is not strictly necessary, but rather allows non conformant clients to handle JSON request/response bodies. We need to make a conscious decision if we need to maintain such configuration and estimate the value and cost of maintaining it. (Especially considering the work to allow other (custom) JSON serializers.

Secondly escapeNonAscii is very limited and allows only two choices: escape non, escape a fixed set of characters (code points) It is worth investigating if a more flexible configuration that gives clients more control over the set of escape characters would be more maintainable.

habbes commented 1 year ago

ODL supports UTF-8 and UTF-16. Regarding the escaping, there have been some interesting discussions around what should be escaped by default, how much flexibility to offer users and how to go about it:

Some of the concerns do relate to security.

My personal view is that we should not proceed with offering more escaping options. I think we should let people use built-in encoders like the JavaScriptEncoder. If we would like to offer extensibility beyond what JavaScriptEncoder allows, then maybe it would make more to propose changes or contributions. Our current string escaping logic is less efficient, and it might not be worthwhile to try to improve it if there's already something built-in that works.

chrisspre commented 1 year ago

ad 1)

RFC 8259 Section 8.1 JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8

ad 2) my point is that we should either not support any option (since any standard conformant client is covered) OR support more options that actually help a decent amount of non-conformant clients and not just one.

corranrogue9 commented 1 year ago

Breaking change because the ask is explicitly about remove the configurability. Probably an investigation about if we truly want to use the .NET implementation will help us understand answers to other questions.

corranrogue9 commented 1 year ago

@habbes it seems you and @chrisspre might have differing ideas about this, please work with each other during the investigation.

corranrogue9 commented 1 year ago

Our dependency on the .NET implementation has causes us issues in the past with routing in WebApi. John might have more context.

OData / odata.net

create better separation of concerns for unicode encoding and json formatting #2474