Open davidmatson opened 1 year ago
Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis See info in area-owners.md if you want to be subscribed.
Author: | davidmatson |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Text.Json` |
Milestone: | - |
Thanks. Because we have too many open issues describing variations of the same problem, I'm going to keep this one open and close #42847, #87138, #86810 and #86805. Feel free to incorporate any context from the other issues that you feel is missing so that it can be reviewed holistically.
This API proposal is for a minimal encoder, stating it leaves it up to the caller to further escape content correctly for embedding in whatever other container language they need.
However, there's a problem with that for embedding in HTML inside 'script islets' - i.e. inside <script type="application/json">
tags. Anything goes inside such tags, except a closing </script>
tag. (Including all possible permutations involving white-space etc.)
The way to escape HTML content is to use entity encoding, e.g. to escape <
use either the named <
or the numeric <
escape. However, this will specifically not work inside <script>
tags.
E.g. while
<script type="application/json">
{ "foo" : "</script>" }
</script>
is suitably enough escaped, when the content is read back via the DOM (e.g. via .innerText
; .text
or .textContent
) it will still verbatim produce the string { "foo" : "</script>" }
with the encoded character remaining.
It's decidedly non-trivial to decide what entity-encoding signifies an encoded parameter that needs decoding - and what doesn't. Maybe your JSON contains a series of resource string translations for a technical editing application that talks about how to represent HTML entities and was meant to contain an entity-encoded example?
So to get this right, your encoder has to encode all occurences of entity-like sequences and then you have to decode them again when attempting to read this stuff back.
Would be better if the new API proposal would be extended to allow specifying additional characters that should be encoded with \u
encoded sequences. In that case, callers that know the JSON is destined to be placed inside such HTML script islets could add <
to the to-be-encoded set of characters and have them be encoded directly within the JSON itself - and then that would suffice.
@rjgotten - that's a very interesting case.
From looking at the API surface, I believe the same question applies to the existing JavaScriptEncoder.UnsafeRelaxedJsonEscaping
.
string data = "</script>";
string json = JsonSerializer.Serialize(data, new JsonSerializerOptions {
Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping });
Console.WriteLine(json);
produces:
"</script>"
Would be better if the new API proposal would be extended to allow specifying additional characters that should be encoded with \u encoded sequences. In that case, callers that know the JSON is destined to be placed inside such HTML script islets could add < to the to-be-encoded set of characters and have them be encoded directly within the JSON itself - and then that would suffice.
I think that's an interesting option to consider. I'd tend to leave that functionality out of this API for simplicity. As far as I can tell, JavaScriptEncoder.UnsafeRelaxedJsonEscaping also does not escape these characters and cannot be customized to escape them (without subclassing) - I'd tend to do the same here.
Inspired by this problem, I made https://github.com/osexpert/ExtremeJsonEncoders. Inspired by @rjgotten I also added extraAsciiEscapeChars option to MinimalJsonEncoder (limiting it to ascii made it easier/faster, can get away with a 128 bool[] map). The code is a ripoff/fork from code in dotnet runtime.
Tagging subscribers to this area: @dotnet/area-system-text-encodings-web See info in area-owners.md if you want to be subscribed.
@eiriktsarpalis why transfer this to encodings.web? It seems useful outside of the context of web. I think it'd be nice to have in STJ (where it would be used) rather than a different package (requiring an additional reference).
It's just where all the encoding implementations are located.
Background and motivation
There's no built-in implementation that allows characters from all languages to be kept readable, and unnecessary escaping to be avoided when the caller knows recipients parse JSON correctly.
For additional context, see:
42847
86800
87138
API Proposal
PR #87147 has additional implementation details.
API Usage
Or, to force hex escapes (\uxxxx) rather than two-character escapes (for example, \"):
Alternative Designs
No response
Risks
Similar to UnsafeRelaxedJsonEncoder, but see #87138.
Callers need to ensure two things: