dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

[System.Text.Json] No way to not encode Emoji characters in json #54193

Closed RamjotSingh closed 2 years ago

RamjotSingh commented 3 years ago

When serializing json using System.Text.Json, emojis get encoded to \u format. For example. 📈 is serialized to \uD83D\uDCC8

Typically you can stop System.Text.Json from enforcing strong rules using Relaxed encoder. This however is not true for emojis which are always encoded. We have clients who have been using this for years and things have worked but moving to System.Text.Json will make a behavior change which we can't expect clients to adapt to (we have an API interface with millions of developers developing on).

I would expect relaxed encoder to not force encoding emoji or have an option to set the encoder to not encode emojis.

ghost commented 3 years ago

Tagging subscribers to this area: @eiriktsarpalis, @layomia See info in area-owners.md if you want to be subscribed.

Issue Details
When serializing json using System.Text.Json, emojis get encoded to `\u` format. For example. 📈 is serialized to `\uD83D\uDCC8` Typically you can stop System.Text.Json from enforcing strong rules using Relaxed encoder. This however is not true for emojis which are always encoded. We have clients who have been using this for years and things have worked but moving to System.Text.Json will make a behavior change which we can't expect clients to adapt to (we have an API interface with millions of developers developing on). I would expect relaxed encoder to not force encoding emoji or have an option to set the encoder to not encode emojis.
Author: RamjotSingh
Assignees: -
Labels: `area-System.Text.Json`, `untriaged`
Milestone: -
ghost commented 3 years ago

Tagging subscribers to this area: @tarekgh, @eiriktsarpalis, @layomia See info in area-owners.md if you want to be subscribed.

Issue Details
When serializing json using System.Text.Json, emojis get encoded to `\u` format. For example. 📈 is serialized to `\uD83D\uDCC8` Typically you can stop System.Text.Json from enforcing strong rules using Relaxed encoder. This however is not true for emojis which are always encoded. We have clients who have been using this for years and things have worked but moving to System.Text.Json will make a behavior change which we can't expect clients to adapt to (we have an API interface with millions of developers developing on). I would expect relaxed encoder to not force encoding emoji or have an option to set the encoder to not encode emojis.
Author: RamjotSingh
Assignees: -
Labels: `area-System.Text.Encodings.Web`, `untriaged`
Milestone: -
GrabYourPitchforks commented 3 years ago

Related: https://github.com/dotnet/runtime/issues/37364, and this has also come up before in issues like https://github.com/dotnet/runtime/issues/42847.

The tricky part here is to figure out exactly what behavior is appropriate to expose to developers. The title of this issue says that emojis shouldn't be encoded, but the text of the issue says "will make a behavior change which we can't expect clients to adapt to." This strongly implies that this issue isn't really about emoji characters, but about having an output that is byte-for-byte identical to Newtonsoft.

In that case, is the request really "allow all chars except those that expressly don't require escaping per the RFC?" We could certainly ship such an API, but even that has several problems. First, per https://github.com/dotnet/runtime/issues/1564#issuecomment-504780719, even the world's most heavily used web services find such a policy too relaxed and too prone to misuse. Second, because such an encoder would utilize a deny list instead of an allow list, using it in production software would be a violation of Microsoft's SDL policy. (Last I checked, SDL mandated use of allow list-based encoders unless you seek a policy exception.)

So maybe the answer is "well, extend only the unsafe relaxed encoder with support for emoji chars." Which at first glance would work, but this would carry a heavy performance penalty, in terms of both CPU time and memory utilization. For reference, the unsafe relaxed encoder uses an 8 KB lookup table (64 kilobits, one bit for each char in the BMP) to know which chars are valid to go out on the wire unescaped. If we extend this to allow the full range of Unicode chars, this lookup table blows up to 139 KB in size (and increases the size of the DLL correspondingly). Maybe the answer is to use a multi-level lookup table like what CharUnicodeInfo does, but now you're trading off memory footprint for increased CPU time and cache thrashing.

I'm not trying to minimize the concern. Just responding that the problem is much more complex than it first appears. Often the requestor needs more than just "allow emojis", or they're trying to work around some other bug that simple emoji support won't fix, or there's some other constraint that complicates our ability to provide a solution that works in their specific environment.

All of that said, the existing encoder infrastructure is extensible enough to allow nearly any scenario to work. The following sample shows creating a custom encoder which wraps JavaScriptEncoder.Default, but which also allows the specific emoji '😃' (U+1F603) to pass through unescaped. The program prints Hello there! 😃\uD83D\uDE3A to the console. This sample may be sufficient to unblock you while we work with you to investigate whether we should make in-box changes to support this scenario.

using System;
using System.Buffers;
using System.Text;
using System.Text.Encodings.Web;

namespace MyApp
{
    class Program
    {
        static void Main(string[] args)
        {
            MyEncoder encoder = new MyEncoder();

            string input = "Hello there! 😃😺";
            string output = encoder.Encode(input);

            Console.WriteLine(output);
        }
    }

    // Any Encoder must override the four methods shown below.
    unsafe class MyEncoder : JavaScriptEncoder
    {
        public override int MaxOutputCharactersPerInputCharacter => JavaScriptEncoder.Default.MaxOutputCharactersPerInputCharacter;

        public override unsafe int FindFirstCharacterToEncode(char* text, int textLength)
        {
            ReadOnlySpan<char> input = new ReadOnlySpan<char>(text, textLength);
            int idx = 0;

            // Enumerate until we're out of data or saw invalid input
            while (Rune.DecodeFromUtf16(input.Slice(idx), out Rune result, out int charsConsumed) == OperationStatus.Done)
            {
                if (WillEncode(result.Value)) { break; } // found a char that needs to be escaped
                idx += charsConsumed;
            }

            if (idx == input.Length) { return -1; } // walked entire input without finding a char which needs escaping
            return idx;
        }

        public override bool WillEncode(int unicodeScalar)
        {
            // Allow U+1F603 SMILING FACE WITH OPEN MOUTH ('😃'),
            // and for all other chars defer to the default escaper.

            if (unicodeScalar == 0x1F603) { return false; } // does not require escaping
            else { return JavaScriptEncoder.Default.WillEncode(unicodeScalar); }
        }

        public override unsafe bool TryEncodeUnicodeScalar(int unicodeScalar, char* buffer, int bufferLength, out int numberOfCharactersWritten)
        {
            // For anything that needs to be escaped, defer to the default escaper.
            return JavaScriptEncoder.Default.TryEncodeUnicodeScalar(unicodeScalar, buffer, bufferLength, out numberOfCharactersWritten);
        }
    }
}
RamjotSingh commented 3 years ago

@GrabYourPitchforks Fair comment. I suppose the real ask is to give a Newtonsoft.Json compatible encoder.

I brought up Emojis specifically coz it is something I saw break even after using Relaxed encoder. You are right and I am in no way claiming that fixing this is enough.

Again I bring up Newtonsoft.Json because that is the most popular C# Json processor.


Thanks for the piece of code. A follow up question on it. If I was to go around and add support for all emojis, would I need to list down every emoji in that map or is there a quicker version?

GrabYourPitchforks commented 3 years ago

If I was to go around and add support for all emojis, would I need to list down every emoji in that map or is there a quicker version?

The sample above just shows how to override WillEncode, but its implementation is up to you. You could have a giant switch statement, a Hashset<int> that contains all emoji chars, a BitArray lookup, or any other data structure appropriate to your scenario.

GrabYourPitchforks commented 3 years ago

Area owners (@layomia @eiriktsarpalis @tarekgh):

The code sample I pasted above has significant boilerplate. At its basic level, this could be simplified through APIs like the below.

public interface ITextEncoderFilter
{
    bool IsAllowedUnescaped(Rune value);
}

public class JavaScriptEncoder /* and other TextEncoder-derived types */
{
    public static JavaScriptEncoder CreateFromFilter(ITextEncoderFilter filter);
}

This doesn't let you control how characters are escaped. For example, the REVERSE SOLIDUS character \ would be escaped as \u005C instead of \\, since that's how the default implementation escapes it. But it would allow you to control which characters are escaped. Would need to figure out what happens if the filter says "oh, sure, " doesn't need to be escaped!" or if the implementation simply bypasses the filter for those characters. And would need to figure out if there should a simpler ability to provide custom escaping logic for these chars. (That significantly complicates things since there's also the vestigial MaxOutputCharsPerInputChar property.)

RamjotSingh commented 3 years ago

For rest of the stuff we are already using Relaxed encoder so this will be building on top of that. That already takes care of encoding \ as \\. I will lay the code you have above on top of that.

RamjotSingh commented 3 years ago

@GrabYourPitchforks I have written the encoder which accept Emojis (builds on top of UnsafeRelaxedEncoder) https://github.com/RamjotSingh/System.Text.Json.Extensions/blob/main/RamjotSingh.System.Text.Json.Extensions/Encoders/NewtonsoftJsonCompatibleEncoder.cs

I wrote some tests and things look to be working fine. Would you be able to take a look and see if the encoder makes sense?

I had to copy Rune and few more files from the runtime repo since they are available only in Net3+ while I needed to target NetStd2.0

GrabYourPitchforks commented 3 years ago

@RamjotSingh The encoder logic looks fine. If you need a netstandard2.0-compatible version of the Rune sources, you can find one at https://github.com/dotnet/runtime/blob/main/src/libraries/System.Text.Encodings.Web/src/Polyfills/System.Text.Rune.netstandard20.cs. (This is the same file used by the netstandard2.0-compatible encoders package.)

I didn't check the tool which scrapes https://unicode.org/emoji/charts/full-emoji-list.html and generates a list of allowed emoji chars.

RamjotSingh commented 3 years ago

The code for the scraper is at https://github.com/RamjotSingh/EmojiNet/blob/026453f0a643299f7b1152d57e46185fe1380e19/EmojiNet/EmojiSiteProcessor/Program.cs#L45

It's basically downloading the unicode website and pulling all the U+ codes from it and constructing a giant switch statement with all the int values (emojis with combination of U+ values are split and each individual value considered a valid scalar).

Here is the file it generates https://github.com/RamjotSingh/EmojiNet/blob/main/EmojiNet/RamjotSingh.EmojiNet/UnicodeScalarHelper.cs

eiriktsarpalis commented 3 years ago

Moving to future, we could perhaps look into providing an API based on @GrabYourPitchforks's sketch design.

jeffhandley commented 2 years ago

@dotnet/area-system-text-json @GrabYourPitchforks I'm inclined to close this issue and rely on the ability to accomplish this through a custom encoder. I'll go ahead and close it, but please reopen it if you disagree.