dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.99k stars 4.66k forks source link

[API Proposal]: Allow custom JavaScriptEncoder that allows HTML-sensitive characters #70419

Open habbes opened 2 years ago

habbes commented 2 years ago

Background and motivation

The OData team is adopting Utf8JsonWriter to improve its JSON serialization performance. Currently it uses a custom-built JsonWriter. To minimize breaking changes and friction to our users, we would like the new writer to be compatible with the existing output as far as the serialized output is concerned. One incompatibility that has emerged is how the two writers handle string escaping:

The OData writer by default escapes control chars (< 0x20), non-ASCII chars (> 0x7F) and characters like ", \, \n, \b, \f, \r, \t.

None of the built-in JavaScriptEncoder implementing matching escaping rules.

The JavaScriptEncoder.Default escapes all the characters the OData writer escapes, but it also escapes HTML-sensitive characters like < and > which OData does not. It also escapes double quote using \u0022 where OData escapes it using a backslash: \".

The JavaScriptEncoder.UnsafeRelaxedJsonEscaping does not escape HTML-sensitive characters, but it also does not escape non-ASCII characters (> 0x7f).

I tried to create a custom TextEncoderSettings object to explicitly allow the characters that I do not want to be escaped. I explicitly allowed characters like < and passed it to JavaScriptEncoder.Create(settings). But the HTML characters were still escaped. It seems like JavaScriptEncoder.Create is hardwired to forbid HTML characters even when the user explicitly allows them.

JavaScriptEncoder.Create calls the constructor of the internal DefaultJavaScriptEncoder(TextEncoderSettings settings, bool allowMinimalJsonEscaping). This creates an OptimizedInboxTextEncoder with the option to forbid HTML characters depending on whether allowMinimalJsonEscaping is set to true or false. This allowMinimalJsonEscaping is set to false when creating an encoder with custom settings. And there does not seem to be any option for the user to enable it.

It would be great if the user had the option to set allowMinimalJsonEscaping to true when calling JavaScriptEncoder.Create, or any alternative that allows the bypassing the HTML escaping.

API Proposal

namespace System.Text.Encodings.Web

public abstract class JavaScriptEncoder : TextEncoder
{
    public static JavaScriptEncoder Create(TextEncoderSettings settings, bool allowMinimalJsonEscaping);
}

API Usage

// Fancy the value
var settings = new TextEncoderSettings();
settings.AllowRange(UnicodeRanges.BasicLatin);
settings.AllowCharacter(...);

var encoder = JavaScriptEncoder.Create(settings, true);

var writer = new Utf8JsonWriter(output, new JsonWriterSettings { Encoder = encoder });

writer.WriterStringValue("A is < B"); // writes "A is < B" instead of "A is \u003C B"

Alternative Designs

Alternatively, you can change the behaviour of JavaScriptEncoder.Create(TextEncoderSettings) such that it does not forbid HTML-sensitive characters. But this would be a breaking change.

Risks

Allow HTML-sensitive characters presents the same risks as using the existing JavaScriptEncoder.UnsafeRelaxedJsonEscaping. Those risks are outlined in these docs. Our use-case is sending a JSON response when application/json; charset = utf-8 header is set.

AspNetCore also uses JavaScriptEncoder.UnsafeRelaxedJsonEscaping for JSON serialization by default.

The new API would essentially be JavaScriptEncoder.UnsafeRelaxedJsonEscaping with a bit more control to escape additional characters.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-text-encodings-web See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation The OData team is adopting `Utf8JsonWriter` to improve its JSON serialization performance. Currently it uses a custom-built `JsonWriter`. To minimize breaking changes and friction to our users, we would like the new writer to be compatible with the existing output as far as the serialized output is concerned. One incompatibility that has emerged is how the two writers handle string escaping: The OData writer by default escapes control chars (< `0x20`), non-ASCII chars (> `0x7F`) and characters like `", \, \n, \b, \f, \r, \t`. None of the built-in `JavaScriptEncoder` implementing matching escaping rules. The `JavaScriptEncoder.Default` escapes all the characters the OData writer escapes, but it also escapes HTML-sensitive characters like `<` and `>` which OData does not. It also escapes double quote using `\u0022` where OData escapes it using a backslash: `\"`. The `JavaScriptEncoder.UnsafeRelaxedJsonEscaping` does not escape HTML-sensitive characters, but it also does not escape non-ASCII characters (> `0x7f`). I tried to create a custom `TextEncoderSettings` object to explicitly allow the characters that I do not want to be escaped. I explicitly allowed characters like `<` and passed it to `JavaScriptEncoder.Create(settings)`. But the HTML characters were still escaped. It seems like `JavaScriptEncoder.Create` is hardwired to forbid HTML characters even when the user explicitly allows them. `JavaScriptEncoder.Create` calls the constructor of the internal [`DefaultJavaScriptEncoder(TextEncoderSettings settings, bool allowMinimalJsonEscaping)`](https://github.com/dotnet/runtime/blob/12a6db44b3f3d69f37f7bf0eb83664e192a0f3e4/src/libraries/System.Text.Encodings.Web/src/System/Text/Encodings/Web/DefaultJavaScriptEncoder.cs#L21). This creates an `OptimizedInboxTextEncoder` with the option to forbid HTML characters depending on whether `allowMinimalJsonEscaping` is set to `true` or `false`. [This `allowMinimalJsonEscaping` is set to false](https://github.com/dotnet/runtime/blob/12a6db44b3f3d69f37f7bf0eb83664e192a0f3e4/src/libraries/System.Text.Encodings.Web/src/System/Text/Encodings/Web/DefaultJavaScriptEncoder.cs#L17) when creating an encoder with custom settings. And there does not seem to be any option for the user to enable it. It would be great if the user had the option to set `allowMinimalJsonEscaping` to `true` when calling `JavaScriptEncoder.Create`, or any alternative that allows the bypassing the HTML escaping. ### API Proposal ```csharp namespace System.Text.Encodings.Web public abstract class JavaScriptEncoder : TextEncoder { public static JavaScriptEncoder Create(TextEncoderSettings settings, bool allowMinimalJsonEscaping); } ``` ### API Usage ```csharp // Fancy the value var settings = new TextEncoderSettings(); settings.AllowRange(UnicodeRanges.BasicLatin); settings.AllowCharacter(...); var encoder = JavaScriptEncoder.Create(settings, true); var writer = new Utf8JsonWriter(output, new JsonWriterSettings { Encoder = encoder }); writer.WriterStringValue("A is < B"); // writes "A is < B" instead of "A is \u003C B" ``` ### Alternative Designs Alternatively, you can change the behaviour of `JavaScriptEncoder.Create(TextEncoderSettings)` such that it does not forbid HTML-sensitive characters. But this would be a breaking change. ### Risks Allow HTML-sensitive characters presents the same risks as using the existing `JavaScriptEncoder.UnsafeRelaxedJsonEscaping`. Those risks are [outlined in these docs](https://docs.microsoft.com/en-us/dotnet/standard/serialization/system-text-json-character-encoding#serialize-all-characters). Our use-case is sending a JSON response when `application/json; charset = utf-8` header is set. AspNetCore also [uses `JavaScriptEncoder.UnsafeRelaxedJsonEscaping`](https://github.com/dotnet/aspnetcore/blob/86e28e7b267579e2999e06b63eeef27e4836384a/src/Mvc/Mvc.Core/src/Formatters/SystemTextJsonOutputFormatter.cs#L40) for JSON serialization by default. The new API would essentially be `JavaScriptEncoder.UnsafeRelaxedJsonEscaping` with a bit more control to escape additional characters.
Author: habbes
Assignees: -
Labels: `api-suggestion`, `area-System.Text.Encodings.Web`, `untriaged`
Milestone: -
habbes commented 2 years ago

If this feature request is approved, I would be happy to contribute the changes (with proper guidance).

danmoseley commented 2 years ago

@dotnet/area-system-text-encodings-web it looks like triage missed this. could we share whether this is something we'd consider? @GrabYourPitchforks may have an opinion here.

GrabYourPitchforks commented 2 years ago

Largely a duplicate of https://github.com/dotnet/runtime/issues/54193.

But the HTML characters were still escaped. It seems like JavaScriptEncoder.Create is hardwired to forbid HTML characters even when the user explicitly allows them.

That is correct. The issue above (see also https://github.com/dotnet/runtime/issues/1564#issuecomment-504780719) goes into this in more detail. That's also why we have the "unsafe relaxed" encoder. We provide two switches: give me the safest thing, or give me something which closely aligns with the spec. We're not terribly interested in providing "mix-and-match" style functionality out of the box since these go down the path of very edge case scenarios.

In OP's example, the reason they want this functionality is that they want an absolute guarantee of \<some output sequence> given \<some input sequence>. But the encoders have never guaranteed output stability for any given input. We change the logic every release to follow best practice and to account for updates to the Unicode spec. Even in .NET Framework - which has extremely strict compat requirements! - we changed how characters were encoded every so often. The API OP wants us to expose would not address their issue of "we want this exact sequence of bytes to be emitted given this input." That's simply a guarantee we intentionally do not fulfill.

If you want to emit a specific sequence of bytes, you can subclass the encoder and override the appropriate members. The issue I linked above contains sample code showing how to do this. The difficulty given the current design is that many of the workhorse methods require you to write unsafe code.

I think it would be sensible to introduce a "simple" interface which allows you to take over the entire encoding process, something like:

public interface ITextEncoder
{
    bool WillEncode(Rune value);
    bool TryEncodeValue(Rune value, Span<char> destination, out int charsWritten);
}

(And in fact this is how the encoders were initially designed all those years back, before they were copied in to .NET Core 1.0.)

habbes commented 2 years ago

@GrabYourPitchforks thanks for following up and thanks for the detailed insights as well as the linked issues. Since the new writer we have implemented in our serializer requires users to explicitly opt-in to it, we have settled on making it clear in documentation that the byte-to-byte output and encoding might differ from the default. We'll also let users pass their preferred JavaScriptEncoder to be used. In the extreme cause maybe, they can subclass JavaScriptEncoder and override the escaping behaviour.

The API you have proposed looks good. Is it likely to make it in the standard library in the foreseeable future?

jeffhandley commented 2 years ago

The API you have proposed looks good. Is it likely to make it in the standard library in the foreseeable future?

This isn't in consideration for .NET 7. @habbes Would you like to refine the original proposal here to align with @GrabYourPitchforks' proposal to see if this could get queued up for API Review?

ogjkoch commented 6 months ago

Has there been any movement on this? I'm (unfortunately) working on migrating a large, old .NET Framework ASP.Net project to .NET 8 and we're having interminable problems with not having anything 100% equivalent to Json.Encode from Framework and no way to get there.