dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.35k stars 4.74k forks source link

[API Proposal]: UnicodeJsonEncoder #87153

Open davidmatson opened 1 year ago

davidmatson commented 1 year ago

Background and motivation

There's no built-in implementation that allows characters from all languages to be kept readable, and unnecessary escaping to be avoided when the caller knows recipients parse JSON correctly.

For additional context, see:

42847

86800

87138

API Proposal

namespace System.Text.Encodings.Web
{
    internal sealed class UnicodeJsonEncoder : JavaScriptEncoder
    {
        internal static readonly UnicodeJsonEncoder Singleton = new UnicodeJsonEncoder();

        private readonly bool _preferHexEscape;
        private readonly bool _preferUppercase;

        public UnicodeJsonEncoder()
            : this(preferHexEscape: false, preferUppercase: false)
        {
        }

        public UnicodeJsonEncoder(bool preferHexEscape, bool preferUppercase)
        {
            _preferHexEscape = preferHexEscape;
            _preferUppercase = preferUppercase;
        }

        // Implementations of base class members.
    }
}
namespace System.Text.Encodings.Web
{
    public abstract class JavaScriptEncoder : TextEncoder
    {
        // Existing members

        public static JavaScriptEncoder Unicode => UnicodeJsonEncoder.Singleton;
    }
}

PR #87147 has additional implementation details.

API Usage

// some typed variable with the JSON object to serialize, called "data"
string json = JsonSerializer.Serialize(data new JsonSerializerOptions {  Encoder = JavaScriptEncoder.Unicode });

Or, to force hex escapes (\uxxxx) rather than two-character escapes (for example, \"):

// some typed variable with the JSON object to serialize, called "data"
string json = JsonSerializer.Serialize(data new JsonSerializerOptions {  Encoder = new UnicodeJsonEncoder(preferHexEscape: true, preferUppercase; false) }); // or other values for those bools

Alternative Designs

No response

Risks

Similar to UnsafeRelaxedJsonEncoder, but see #87138.

Callers need to ensure two things:

  1. If this JSON output is embedded inside another language (HTML, SQL, C#, etc.), the text is correctly escaped according to that language's requirements.
  2. The recipient and intermediaries follow the JSON spec correctly (can handle any Unicode characters that are unescaped like they do when such characters are escaped).
ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation There's no built-in implementation that allows characters from all languages to be kept readable, and unnecessary escaping to be avoided when the caller knows recipients parse JSON correctly. For additional context, see: #42847 #86800 #87138 ### API Proposal ```csharp namespace System.Text.Encodings.Web { internal sealed class UnicodeJsonEncoder : JavaScriptEncoder { internal static readonly UnicodeJsonEncoder Singleton = new UnicodeJsonEncoder(); private readonly bool _preferHexEscape; private readonly bool _preferUppercase; public UnicodeJsonEncoder() : this(preferHexEscape: false, preferUppercase: false) { } public UnicodeJsonEncoder(bool preferHexEscape, bool preferUppercase) { _preferHexEscape = preferHexEscape; _preferUppercase = preferUppercase; } // Implementations of base class members. } } ``` ```csharp namespace System.Text.Encodings.Web { public abstract class JavaScriptEncoder : TextEncoder { // Existing members public static JavaScriptEncoder Unicode => UnicodeJsonEncoder.Singleton; } } ``` PR #87147 has additional implementation details. ### API Usage ```csharp // some typed variable with the JSON object to serialize, called "data" string json = JsonSerializer.Serialize(data new JsonSerializerOptions { Encoder = JavaScriptEncoder.Unicode }); ``` Or, to force hex escapes (\uxxxx) rather than two-character escapes (for example, \"): ```csharp // some typed variable with the JSON object to serialize, called "data" string json = JsonSerializer.Serialize(data new JsonSerializerOptions { Encoder = new UnicodeJsonEncoder(preferHexEscape: true, preferUppercase; false) }); // or other values for those bools ``` ### Alternative Designs _No response_ ### Risks Similar to UnsafeRelaxedJsonEncoder, but see ##87138. Callers need to ensure two things: 1. If this JSON output is embedded inside another language (HTML, SQL, C#, etc.), the text is correctly escaped according to that language's requirements. 2. The recipient and intermediaries follow the JSON spec correctly (can handle any Unicode characters that are unescaped like they do when such characters are escaped).
Author: davidmatson
Assignees: -
Labels: `api-suggestion`, `area-System.Text.Json`
Milestone: -
eiriktsarpalis commented 1 year ago

Thanks. Because we have too many open issues describing variations of the same problem, I'm going to keep this one open and close #42847, #87138, #86810 and #86805. Feel free to incorporate any context from the other issues that you feel is missing so that it can be reviewed holistically.

rjgotten commented 1 year ago

This API proposal is for a minimal encoder, stating it leaves it up to the caller to further escape content correctly for embedding in whatever other container language they need.

However, there's a problem with that for embedding in HTML inside 'script islets' - i.e. inside <script type="application/json"> tags. Anything goes inside such tags, except a closing </script> tag. (Including all possible permutations involving white-space etc.)

The way to escape HTML content is to use entity encoding, e.g. to escape < use either the named &lt; or the numeric &#60; escape. However, this will specifically not work inside <script> tags.

E.g. while

<script type="application/json">
{ "foo" : "&lt;/script>" }
</script>

is suitably enough escaped, when the content is read back via the DOM (e.g. via .innerText; .text or .textContent ) it will still verbatim produce the string { "foo" : "&lt;/script>" } with the encoded character remaining.

It's decidedly non-trivial to decide what entity-encoding signifies an encoded parameter that needs decoding - and what doesn't. Maybe your JSON contains a series of resource string translations for a technical editing application that talks about how to represent HTML entities and was meant to contain an entity-encoded example?

So to get this right, your encoder has to encode all occurences of entity-like sequences and then you have to decode them again when attempting to read this stuff back.

Would be better if the new API proposal would be extended to allow specifying additional characters that should be encoded with \u encoded sequences. In that case, callers that know the JSON is destined to be placed inside such HTML script islets could add < to the to-be-encoded set of characters and have them be encoded directly within the JSON itself - and then that would suffice.

davidmatson commented 1 year ago

@rjgotten - that's a very interesting case.

From looking at the API surface, I believe the same question applies to the existing JavaScriptEncoder.UnsafeRelaxedJsonEscaping.

string data = "</script>";
string json = JsonSerializer.Serialize(data, new JsonSerializerOptions {
    Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping });
Console.WriteLine(json);

produces:

"</script>"

Would be better if the new API proposal would be extended to allow specifying additional characters that should be encoded with \u encoded sequences. In that case, callers that know the JSON is destined to be placed inside such HTML script islets could add < to the to-be-encoded set of characters and have them be encoded directly within the JSON itself - and then that would suffice.

I think that's an interesting option to consider. I'd tend to leave that functionality out of this API for simplicity. As far as I can tell, JavaScriptEncoder.UnsafeRelaxedJsonEscaping also does not escape these characters and cannot be customized to escape them (without subclassing) - I'd tend to do the same here.

osexpert commented 1 month ago

Inspired by this problem, I made https://github.com/osexpert/ExtremeJsonEncoders. Inspired by @rjgotten I also added extraAsciiEscapeChars option to MinimalJsonEncoder (limiting it to ascii made it easier/faster, can get away with a 128 bool[] map). The code is a ripoff/fork from code in dotnet runtime.

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @dotnet/area-system-text-encodings-web See info in area-owners.md if you want to be subscribed.

gregsdennis commented 1 month ago

@eiriktsarpalis why transfer this to encodings.web? It seems useful outside of the context of web. I think it'd be nice to have in STJ (where it would be used) rather than a different package (requiring an additional reference).

eiriktsarpalis commented 1 month ago

It's just where all the encoding implementations are located.