Closed RamjotSingh closed 2 years ago
Tagging subscribers to this area: @eiriktsarpalis, @layomia See info in area-owners.md if you want to be subscribed.
Author: | RamjotSingh |
---|---|
Assignees: | - |
Labels: | `area-System.Text.Json`, `untriaged` |
Milestone: | - |
Tagging subscribers to this area: @tarekgh, @eiriktsarpalis, @layomia See info in area-owners.md if you want to be subscribed.
Author: | RamjotSingh |
---|---|
Assignees: | - |
Labels: | `area-System.Text.Encodings.Web`, `untriaged` |
Milestone: | - |
Related: https://github.com/dotnet/runtime/issues/37364, and this has also come up before in issues like https://github.com/dotnet/runtime/issues/42847.
The tricky part here is to figure out exactly what behavior is appropriate to expose to developers. The title of this issue says that emojis shouldn't be encoded, but the text of the issue says "will make a behavior change which we can't expect clients to adapt to." This strongly implies that this issue isn't really about emoji characters, but about having an output that is byte-for-byte identical to Newtonsoft.
In that case, is the request really "allow all chars except those that expressly don't require escaping per the RFC?" We could certainly ship such an API, but even that has several problems. First, per https://github.com/dotnet/runtime/issues/1564#issuecomment-504780719, even the world's most heavily used web services find such a policy too relaxed and too prone to misuse. Second, because such an encoder would utilize a deny list instead of an allow list, using it in production software would be a violation of Microsoft's SDL policy. (Last I checked, SDL mandated use of allow list-based encoders unless you seek a policy exception.)
So maybe the answer is "well, extend only the unsafe relaxed encoder with support for emoji chars." Which at first glance would work, but this would carry a heavy performance penalty, in terms of both CPU time and memory utilization. For reference, the unsafe relaxed encoder uses an 8 KB lookup table (64 kilobits, one bit for each char in the BMP) to know which chars are valid to go out on the wire unescaped. If we extend this to allow the full range of Unicode chars, this lookup table blows up to 139 KB in size (and increases the size of the DLL correspondingly). Maybe the answer is to use a multi-level lookup table like what CharUnicodeInfo does, but now you're trading off memory footprint for increased CPU time and cache thrashing.
I'm not trying to minimize the concern. Just responding that the problem is much more complex than it first appears. Often the requestor needs more than just "allow emojis", or they're trying to work around some other bug that simple emoji support won't fix, or there's some other constraint that complicates our ability to provide a solution that works in their specific environment.
All of that said, the existing encoder infrastructure is extensible enough to allow nearly any scenario to work. The following sample shows creating a custom encoder which wraps JavaScriptEncoder.Default
, but which also allows the specific emoji '😃' (U+1F603) to pass through unescaped. The program prints Hello there! 😃\uD83D\uDE3A
to the console. This sample may be sufficient to unblock you while we work with you to investigate whether we should make in-box changes to support this scenario.
using System;
using System.Buffers;
using System.Text;
using System.Text.Encodings.Web;
namespace MyApp
{
class Program
{
static void Main(string[] args)
{
MyEncoder encoder = new MyEncoder();
string input = "Hello there! 😃😺";
string output = encoder.Encode(input);
Console.WriteLine(output);
}
}
// Any Encoder must override the four methods shown below.
unsafe class MyEncoder : JavaScriptEncoder
{
public override int MaxOutputCharactersPerInputCharacter => JavaScriptEncoder.Default.MaxOutputCharactersPerInputCharacter;
public override unsafe int FindFirstCharacterToEncode(char* text, int textLength)
{
ReadOnlySpan<char> input = new ReadOnlySpan<char>(text, textLength);
int idx = 0;
// Enumerate until we're out of data or saw invalid input
while (Rune.DecodeFromUtf16(input.Slice(idx), out Rune result, out int charsConsumed) == OperationStatus.Done)
{
if (WillEncode(result.Value)) { break; } // found a char that needs to be escaped
idx += charsConsumed;
}
if (idx == input.Length) { return -1; } // walked entire input without finding a char which needs escaping
return idx;
}
public override bool WillEncode(int unicodeScalar)
{
// Allow U+1F603 SMILING FACE WITH OPEN MOUTH ('😃'),
// and for all other chars defer to the default escaper.
if (unicodeScalar == 0x1F603) { return false; } // does not require escaping
else { return JavaScriptEncoder.Default.WillEncode(unicodeScalar); }
}
public override unsafe bool TryEncodeUnicodeScalar(int unicodeScalar, char* buffer, int bufferLength, out int numberOfCharactersWritten)
{
// For anything that needs to be escaped, defer to the default escaper.
return JavaScriptEncoder.Default.TryEncodeUnicodeScalar(unicodeScalar, buffer, bufferLength, out numberOfCharactersWritten);
}
}
}
@GrabYourPitchforks Fair comment. I suppose the real ask is to give a Newtonsoft.Json compatible encoder.
I brought up Emojis specifically coz it is something I saw break even after using Relaxed encoder. You are right and I am in no way claiming that fixing this is enough.
Again I bring up Newtonsoft.Json because that is the most popular C# Json processor.
Thanks for the piece of code. A follow up question on it. If I was to go around and add support for all emojis, would I need to list down every emoji in that map or is there a quicker version?
If I was to go around and add support for all emojis, would I need to list down every emoji in that map or is there a quicker version?
The sample above just shows how to override WillEncode
, but its implementation is up to you. You could have a giant switch statement, a Hashset<int>
that contains all emoji chars, a BitArray
lookup, or any other data structure appropriate to your scenario.
Area owners (@layomia @eiriktsarpalis @tarekgh):
The code sample I pasted above has significant boilerplate. At its basic level, this could be simplified through APIs like the below.
public interface ITextEncoderFilter
{
bool IsAllowedUnescaped(Rune value);
}
public class JavaScriptEncoder /* and other TextEncoder-derived types */
{
public static JavaScriptEncoder CreateFromFilter(ITextEncoderFilter filter);
}
This doesn't let you control how characters are escaped. For example, the REVERSE SOLIDUS character \
would be escaped as \u005C
instead of \\
, since that's how the default implementation escapes it. But it would allow you to control which characters are escaped. Would need to figure out what happens if the filter says "oh, sure, "
doesn't need to be escaped!" or if the implementation simply bypasses the filter for those characters. And would need to figure out if there should a simpler ability to provide custom escaping logic for these chars. (That significantly complicates things since there's also the vestigial MaxOutputCharsPerInputChar property.)
For rest of the stuff we are already using Relaxed encoder so this will be building on top of that. That already takes care of encoding \
as \\
. I will lay the code you have above on top of that.
@GrabYourPitchforks I have written the encoder which accept Emojis (builds on top of UnsafeRelaxedEncoder) https://github.com/RamjotSingh/System.Text.Json.Extensions/blob/main/RamjotSingh.System.Text.Json.Extensions/Encoders/NewtonsoftJsonCompatibleEncoder.cs
I wrote some tests and things look to be working fine. Would you be able to take a look and see if the encoder makes sense?
I had to copy Rune and few more files from the runtime repo since they are available only in Net3+ while I needed to target NetStd2.0
@RamjotSingh The encoder logic looks fine. If you need a netstandard2.0-compatible version of the Rune sources, you can find one at https://github.com/dotnet/runtime/blob/main/src/libraries/System.Text.Encodings.Web/src/Polyfills/System.Text.Rune.netstandard20.cs. (This is the same file used by the netstandard2.0-compatible encoders package.)
I didn't check the tool which scrapes https://unicode.org/emoji/charts/full-emoji-list.html and generates a list of allowed emoji chars.
The code for the scraper is at https://github.com/RamjotSingh/EmojiNet/blob/026453f0a643299f7b1152d57e46185fe1380e19/EmojiNet/EmojiSiteProcessor/Program.cs#L45
It's basically downloading the unicode website and pulling all the U+ codes from it and constructing a giant switch statement with all the int values (emojis with combination of U+ values are split and each individual value considered a valid scalar).
Here is the file it generates https://github.com/RamjotSingh/EmojiNet/blob/main/EmojiNet/RamjotSingh.EmojiNet/UnicodeScalarHelper.cs
Moving to future, we could perhaps look into providing an API based on @GrabYourPitchforks's sketch design.
@dotnet/area-system-text-json @GrabYourPitchforks I'm inclined to close this issue and rely on the ability to accomplish this through a custom encoder. I'll go ahead and close it, but please reopen it if you disagree.
When serializing json using System.Text.Json, emojis get encoded to
\u
format. For example. 📈 is serialized to\uD83D\uDCC8
Typically you can stop System.Text.Json from enforcing strong rules using Relaxed encoder. This however is not true for emojis which are always encoded. We have clients who have been using this for years and things have worked but moving to System.Text.Json will make a behavior change which we can't expect clients to adapt to (we have an API interface with millions of developers developing on).
I would expect relaxed encoder to not force encoding emoji or have an option to set the encoder to not encode emojis.