dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.28k stars 4.73k forks source link

Default JSON escaping is biased against other languages #86805

Closed davidmatson closed 1 year ago

davidmatson commented 1 year ago

Description

Default JSON escaping doesn't escape English but does escape other languages.

Reproduction Steps

using System.Text;
using System.Text.Encodings.Web;
using System.Text.Json;

class Program
{
    static void Main()
    {
        Console.OutputEncoding = Encoding.UTF8;

        List<string> data = new List<string>();
        data.Add("English");
        data.Add("日本語");
        data.Add("עברית");
        data.Add("ქართული ენა");
        data.Add("中文");
        data.Add("𑄌𑄋𑄴𑄟𑄳𑄦 𑄃𑄧𑄏𑄛𑄖𑄴");

        string actual = JsonSerializer.Serialize(data);
        Console.WriteLine(actual);
    }
}

Expected behavior

All languages are treated equally.

["English","日本語","עברית","ქართული ენა","中文","𑄌𑄋𑄴𑄟𑄳𑄦 𑄃𑄧𑄏𑄛𑄖𑄴"]

Actual behavior

Some languages are not.

["English","\u65E5\u672C\u8A9E","\u05E2\u05D1\u05E8\u05D9\u05EA","\u10E5\u10D0\u10E0\u10D7\u10E3\u10DA\u10D8 \u10D4\u10DC\u10D0","\u4E2D\u6587","\uD804\uDD0C\uD804\uDD0B\uD804\uDD34\uD804\uDD1F\uD804\uDD33\uD804\uDD26 \uD804\uDD03\uD804\uDD27\uD804\uDD0F\uD804\uDD1B\uD804\uDD16\uD804\uDD34"]

For English speakers, try using the following encoder for a while to get a feel for the significance of this impact on other languages:

using System.Text;
using System.Text.Encodings.Web;
using System.Text.Json;

class MaximalJsonEncoder : JavaScriptEncoder
{
    public override int MaxOutputCharactersPerInputCharacter => 12;

    public override unsafe int FindFirstCharacterToEncode(char* text, int textLength)
    {
        if (textLength == 0)
        {
            return -1;
        }

        return 0;
    }

    public override unsafe bool TryEncodeUnicodeScalar(int unicodeScalar, char* buffer, int bufferLength,
        out int numberOfCharactersWritten)
    {
        Rune rune = new Rune(unicodeScalar);
        const int unicodeEscapeLength = 6; // "\\u0000".Length
        int lengthNeeded = rune.Utf16SequenceLength * unicodeEscapeLength;

        if (bufferLength < lengthNeeded)
        {
            numberOfCharactersWritten = 0;
            return false;
        }

        const int maxCharsPerScalar = 2;
        char* utf16Buffer = stackalloc char[maxCharsPerScalar];
        int utf16Length = rune.EncodeToUtf16(new Span<char>(utf16Buffer, maxCharsPerScalar));
        Span<char> span = new Span<char>(buffer, bufferLength);

        for (int index = 0; index < utf16Length; ++index)
        {
            char toEncode = utf16Buffer[index];
            span[0] = '\\';
            span[1] = 'u';
            span[2] = ToHexDigit((toEncode & 0xf000) >> 12);
            span[3] = ToHexDigit((toEncode & 0xf00) >> 8);
            span[4] = ToHexDigit((toEncode & 0xf0) >> 4);
            span[5] = ToHexDigit(toEncode & 0xf);
            span = span.Slice(unicodeEscapeLength);
        }

        numberOfCharactersWritten = lengthNeeded;
        return true;

    }

    public override bool WillEncode(int unicodeScalar)
    {
        return true;
    }

    static char ToHexDigit(int value)
    {
        if (value > 0xf)
        {
            throw new ArgumentOutOfRangeException(nameof(value));
        }

        if (value <= 10)
        {
            return (char)(value + '0');
        }
        else
        {
            return (char)(value - 0xa + 'a');
        }
    }
}

class Program
{
    static void Main()
    {
        Console.OutputEncoding = Encoding.UTF8;

        Dictionary<string, string> data = new Dictionary<string, string>();
        data.Add("text", "Hello, World!");

        string actual = JsonSerializer.Serialize(data, new JsonSerializerOptions { Encoder = new MaximalJsonEncoder() });
        Console.WriteLine(actual);
    }
}

Regression?

No response

Known Workarounds

A relaxed encoder helps with some languages but not all:

        string actual = JsonSerializer.Serialize(data, new JsonSerializerOptions { Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping });

A custom encoder is currently needed to handle all languages equally.

Configuration

No response

Other information

It could be easy to think that encoding doesn't matter, because it's the same semantic meaning from a programmatic standpoint. But consider the following JSON files:

{"text":"Hello, World!"}

vs.

{"\u0074\u0065\u0078\u0074":"\u0048\u0065\u006c\u006c\u006f\u002c\u0020\u0057\u006f\u0072\u006c\u0064\u0021"}

Even though a parser will treat them as meaning the same thing, these are clearly not equally good JSON files, for two key reasons:

  1. Performance: Each text character is 6x (!) as long as before.
  2. Readability: Unless you have binary-to-Unicode memorized, the second file is unreadable.

.NET should treat other languages equally, and not relegate some to be treated like the second example above.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Default JSON escaping doesn't escape English but does escape other languages. ### Reproduction Steps ```c# using System.Text; using System.Text.Encodings.Web; using System.Text.Json; class Program { static void Main() { Console.OutputEncoding = Encoding.UTF8; List data = new List(); data.Add("English"); data.Add("日本語"); data.Add("עברית"); data.Add("ქართული ენა"); data.Add("中文"); data.Add("𑄌𑄋𑄴𑄟𑄳𑄦 𑄃𑄧𑄏𑄛𑄖𑄴"); string actual = JsonSerializer.Serialize(data); Console.WriteLine(actual); } } ``` ### Expected behavior All languages are treated equally. ### Actual behavior Some languages are not. For English speakers, try using the following encoder for a while to get a feel for the significance of this impact on other languages: ```c# using System.Text; using System.Text.Encodings.Web; using System.Text.Json; class MaximalJsonEncoder : JavaScriptEncoder { public override int MaxOutputCharactersPerInputCharacter => 12; public override unsafe int FindFirstCharacterToEncode(char* text, int textLength) { if (textLength == 0) { return -1; } return 0; } public override unsafe bool TryEncodeUnicodeScalar(int unicodeScalar, char* buffer, int bufferLength, out int numberOfCharactersWritten) { Rune rune = new Rune(unicodeScalar); const int unicodeEscapeLength = 6; // "\\u0000".Length int lengthNeeded = rune.Utf16SequenceLength * unicodeEscapeLength; if (bufferLength < lengthNeeded) { numberOfCharactersWritten = 0; return false; } const int maxCharsPerScalar = 2; char* utf16Buffer = stackalloc char[maxCharsPerScalar]; int utf16Length = rune.EncodeToUtf16(new Span(utf16Buffer, maxCharsPerScalar)); Span span = new Span(buffer, bufferLength); for (int index = 0; index < utf16Length; ++index) { char toEncode = utf16Buffer[index]; span[0] = '\\'; span[1] = 'u'; span[2] = ToHexDigit((toEncode & 0xf000) >> 12); span[3] = ToHexDigit((toEncode & 0xf00) >> 8); span[4] = ToHexDigit((toEncode & 0xf0) >> 4); span[5] = ToHexDigit(toEncode & 0xf); span = span.Slice(unicodeEscapeLength); } numberOfCharactersWritten = lengthNeeded; return true; } public override bool WillEncode(int unicodeScalar) { return true; } static char ToHexDigit(int value) { if (value > 0xf) { throw new ArgumentOutOfRangeException(nameof(value)); } if (value <= 10) { return (char)(value + '0'); } else { return (char)(value - 0xa + 'a'); } } } class Program { static void Main() { Console.OutputEncoding = Encoding.UTF8; Dictionary data = new Dictionary(); data.Add("text", "Hello, World!"); string actual = JsonSerializer.Serialize(data, new JsonSerializerOptions { Encoder = new MaximalJsonEncoder() }); Console.WriteLine(actual); } } ``` ### Regression? _No response_ ### Known Workarounds A relaxed encoder helps with some languages but not all: ```c# string actual = JsonSerializer.Serialize(data, new JsonSerializerOptions { Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping }); ``` A custom encoder is currently needed to handle all languages equally. ### Configuration _No response_ ### Other information It could be easy to think that encoding doesn't matter, because it's the same semantic meaning from a programmatic standpoint. But consider the following JSON files: ```json {"text":"Hello, World!"} ``` vs. ```json {"\u0074\u0065\u0078\u0074":"\u0048\u0065\u006c\u006c\u006f\u002c\u0020\u0057\u006f\u0072\u006c\u0064\u0021"} ``` Even though a parser will treat them as meaning the same thing, these are clearly not equally good JSON files, for two key reasons: 1. Performance: Each text character is 6x (!) as long as before. 2. Readability: Unless you have binary-to-Unicode memorized, the second file is unreadable. .NET should treat other languages equally, and not relegate some to be treated like the second example above.
Author: davidmatson
Assignees: -
Labels: `area-System.Text.Json`, `untriaged`
Milestone: -
rhuijben commented 1 year ago

I think you should measure the difference in UTF-8 as used on the web instead of 16 bit or 32 bit Unicode... That would greatly reduced the encoding difference.

And if you happen to use plain ASCII instead of UTF-8, the requested form of not escaping would not be supported at all. My guess would be that this is the reason things work this way right now.

Gnbrkm41 commented 1 year ago

Default JSON escaping doesn't escape English but does escape other languages.

Right, but I think to say "is biased against other languages" because it only leaves English alone is misleading.

ASCII is used as a basis for a lot of different encoding schemes and is therefore much less likely to be misinterpreted even if the encoding scheme is mixed up. Furthermore, its unique characteristic of having the MSB always being zero make it easy for people to detect that the string consists only of ASCII characters.

Decoding errors due to mismatching encoding is actually something you do observe at times, especially if you use Windows which uses different encoding based on the locale selected. Referred to as 'mojibake' - literally "broken letters" in Japanese (or here in Korea, '뷁어', literally meaning '뷁 language' because the character occurs quite commonly if you were to misinterpret SHIFT_JIS-encoded strings, the default encoding for Japanese Windows, as CP949, the default encoding for Korean Windows), you may encounter this problem if you were to transfer text files encoded in a non-Unicode encoding then try opening that in a different machine that uses a different default encoding.

While it is true that UTF-8 is becoming the de-facto standard for encoding anything, making the outputted string to be encoded only in ASCII helps avoid the problem altogether. It is an unfortunate side-effect of the fact that we did not have a single standard that ruled over all the fragmented sets of encodings for all the non-English languages for a good while.

The basis for making this aggressive escape the default behaviour appears to be 'defence in depth' - though, I'm not sure if escaping non-ASCII characters has something to do with this (as only attacks I find regarding JSON escaping is XSS attacks in html codes) However for a 'default behaviour' which is very hard to change I do think it is better to be safe than sorry.

A relaxed encoder helps with some languages but not all

From what I see, most languages should be fine when using JavaScriptEncoder.UnsafeRelaxedJsonEscaping as it allows all characters that are within the BMP (0x0000-0xFFFF) range to be written unescaped except the ones that are on the blocklist. The problematic one are ones that are outside of the BMP range (which I believe includes some Arabic characters and emojis) - in this case indeed you would need to roll your own JsonEncoder. There is an issue open regarding this (https://github.com/dotnet/runtime/issues/42847), however I imagine allowing this by default would mean some performance degradation during serialization as you'll need to start detecting surrogate pairs and handle it appropriately.

I am not suggesting that this is not a problem that needs solving; I do think it would be very nice to leave these unescaped as well for easier debugging without having to juggle the JSON string to an un-escape tool or something. But I think it deserves a separate issue if we really want that (or could revive the issue above as well perhaps).

It could be easy to think that encoding doesn't matter, because it's the same semantic meaning from a programmatic standpoint....

In my opinion, this is what matters the most. JSON is intended for machine-consumption, and the less we make it fragile the better, especially if it is a default behaviour.

... these are clearly not equally good JSON files, for two key reasons:

Performance: Each text character is 6x (!) as long as before.

Right if you count the characters, but not if you consider the actual size of the encoded characters. With UTF-8, a non-ASCII character will take around 2 to 4 'code units' (in other words, bytes), depending on where the character is located within the Unicode table. With CJK characters that would be 3 code units compared to 6 code units if encoded escaped, so really you're looking at a x2 size difference, and for things outside of the BMP it would be 4 code units unescaped and 6 code units escaped. It would be 6 times size-wise if you were to escape ASCII characters as well, which is technically allowed, but we don't do that by default and no sane person would do that so it would be pointless to argue about this case.

Readability: Unless you have binary-to-Unicode memorized, the second file is unreadable.

Right, but how often would you need to read the raw JSON file, apart from during development / debugging? We could also argue we should enable indentation by default if we were to consider readability. Again, for a 'default' behaviour covering the widest demand, I think it is acceptable to produce escaped strings.

All those points aside, it has been 4 major releases since S.T.Json has been first released. It would be a major breaking change t o 'unbias' the default serialisation behaviour as it would break everyone's code who are depending on the fact that by default S.T.Json produces an escaped JSON string silently (which is the worst kind of breaking change). This is the same reason as to why changing *.Parse or other string comparison methods from being culture-dependent by default to being culture-insensitive by default is impossible now.

rjgotten commented 1 year ago

JSON must always be sent as Unicode. This is part of the format's specification under chapter 2 Conformance.

So both of the following are irrelevant:

And if you happen to use plain ASCII instead of UTF-8, the requested form of not escaping would not be supported at all. My guess would be that this is the reason things work this way right now.

ASCII is used as a basis for a lot of different encoding schemes and is therefore much less likely to be misinterpreted even if the encoding scheme is mixed up. Furthermore, its unique characteristic of having the MSB always being zero make it easy for people to detect that the string consists only of ASCII characters.

Gnbrkm41 commented 1 year ago

JSON must always be sent as Unicode. This is part of the format's specification under chapter 2 Conformance.

Not true, for two reasons:

  1. The JSON specification concerns only the syntax of the format, not how it is stored / transmitted. Direct quote from 2nd Edition of ECMA-404, 'The JSON data interchange syntax', Introduction:

    It is expected that other standards will refer to this one, strictly adhering to the JSON syntax, while imposing semantics interpretation and restrictions on various encoding details. Such standards may require specific behaviours. JSON itself specifies no behaviour.

  2. Even with that, the specificiation does not specify that the text must be encoded in Unicode. Direct quote from the same JSON specification documentation, 2. Conformance:

    A conforming JSON text is a sequence of Unicode code points that strictly conforms to the JSON grammar defined by this specification.

A "Unicode code point" means does not mean it has to be encoded in any of the 'Unicode Encoding Forms' (basically, UTF-8 / UTF-16 / UTF-32). Direct quote from the Unicode Glossary:

Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type.

While confusing as the term 'Unicode' is used interchangeably for multitude of things, what it really means is that only values that has a space on the Unicode codespace is allowed in a valid JSON. It does not imply a codepoint that are not assigned a character is disallowed (implication of which would make additions of characters in newer Unicode revisions make every single spec-conforming library non-conforming until they add support for newer Unicode revisions! - also see '9. String' of ECMA-404 which mentions that the JSON standard allows unassigned codepoints within the string) nor limit its use with only one of the 'Unicode Encoding Forms'. Yes, for practical reasons, the use of UTF-8 / UTF-16 would be the most prevalent, however this does not stop you from using other encodings as long as you can map characters that are encoded in a different encoding to a Unicode codepoint. I hate Unicode glossary

Gnbrkm41 commented 1 year ago

Had a conversation with @GrabYourPitchforks, who gave some interesting insights on some of the reasons why these decisions were made:

  1. There still are JSON deserialisers out there that do not conform to the JSON spec, for example ones that perform charset sniffing (looking at the whole payload to determine the encoding of the text) or treat line break chars as string terminators. Because we're not living in an ideal world, for clients that you cannot necessarily trust to behave correctly, for typical use cases it is safer to aggressively escape non-ASCII characters "by default" and let the users configure to more lax escaping behaviour on their discretion.

If you know that the client you're talking to handles these cases correctly, you can configure the escaping behaviour to not escape any of the characters. For example, ASP.NET can get away with turning off most of the escaping as they can be sure the client is interpreting things correctly by means of Accept request headers and Content-Type response headers.

Apparently many of the "Big tech" services such as Facebook / Twitter / Google / Instagram etc does this as well just in case something bad does happen.

  1. Microsoft's internal security guidelines forbid them from shipping encoders that are based on blocklists, but instead ask them to implement encoders using allow-lists. I imagine this is because if, in the future, things change in a way that new / old characters that did not pose any problem in the past but now can be used for exploits, softwares that still depend on older versions of the library will be vulnerable until they are upgraded to a newer version; so they carefully consider the cases and put them in the allowlist if it is certain to be okay.

The implication of this, is that if we wanted to write characters outside of the BMP without escaping, this allow-list need to include the entirety of the supplementary plane (which is 16x the size of BMP), and that would get pretty huge and would lead to a considerable increase in the assembly size of the library, for a feature without enough demand.

davidmatson commented 1 year ago

There still are JSON deserialisers out there that do not conform to the JSON spec, for example ones that perform charset sniffing (looking at the whole payload to determine the encoding of the text) or treat line break chars as string terminators. Because we're not living in an ideal world, for clients that you cannot necessarily trust to behave correctly, for typical use cases it is safer to aggressively escape non-ASCII characters "by default" and let the users configure to more lax escaping behaviour on their discretion.

Does this give any reason why languages other than English need to be treated as "unsafe" by default?

Microsoft's internal security guidelines forbid them from shipping encoders that are based on blocklists, but instead ask them to implement encoders using allow-lists.

Is there a reason there has to be a list at all? Can all JSON characters that aren't required by the RFC to be encoded not be encoded?

davidmatson commented 1 year ago

The basis for making this aggressive escape the default behaviour appears to be 'defence in depth' - though, I'm not sure if escaping non-ASCII characters has something to do with this (as only attacks I find regarding JSON escaping is XSS attacks in html codes) However for a 'default behaviour' which is very hard to change I do think it is better to be safe than sorry.

Yeah, the only reason I've seen is to help with devs who put JSON in HTML and forget to HTML-escape. Saying other languages aren't as first-class as English doesn't seem justified by that scenario.

I am not suggesting that this is not a problem that needs solving; I do think it would be very nice to leave these unescaped as well for easier debugging without having to juggle the JSON string to an un-escape tool or something. But I think it deserves a separate issue if we really want that (or could revive the issue above as well perhaps).

I think that's exactly what this issue is though - a separate issue to treat other languages as much first-class as we do English.

Right, but how often would you need to read the raw JSON file, apart from during development / debugging? We could also argue we should enable indentation by default if we were to consider readability. Again, for a 'default' behaviour covering the widest demand, I think it is acceptable to produce escaped strings.

I'd suggest trying to use the above MaximalJsonEncoder to feel the impact. A major reason to use JSON is that it's human readable, unless you use an encoder that's biased against other languages and happen to use one of those languages.

All those points aside, it has been 4 major releases since S.T.Json has been first released. It would be a major breaking change t o 'unbias' the default serialisation behaviour as it would break everyone's code who are depending on the fact that by default S.T.Json produces an escaped JSON string silently (which is the worst kind of breaking change). This is the same reason as to why changing *.Parse or other string comparison methods from being culture-dependent by default to being culture-insensitive by default is impossible now.

As mentioned in other issues, folks keep using Newtonsoft rather than switching to this API because of these issues. More significantly, I don't think saying "we've been biased against other languages in the past, and the complaints haven't registered as significant, so it's fine to keep doing it" is the right call here.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-text-encodings-web See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Default JSON escaping doesn't escape English but does escape other languages. ### Reproduction Steps ```c# using System.Text; using System.Text.Encodings.Web; using System.Text.Json; class Program { static void Main() { Console.OutputEncoding = Encoding.UTF8; List data = new List(); data.Add("English"); data.Add("日本語"); data.Add("עברית"); data.Add("ქართული ენა"); data.Add("中文"); data.Add("𑄌𑄋𑄴𑄟𑄳𑄦 𑄃𑄧𑄏𑄛𑄖𑄴"); string actual = JsonSerializer.Serialize(data); Console.WriteLine(actual); } } ``` ### Expected behavior All languages are treated equally. ```json ["English","日本語","עברית","ქართული ენა","中文","𑄌𑄋𑄴𑄟𑄳𑄦 𑄃𑄧𑄏𑄛𑄖𑄴"] ``` ### Actual behavior Some languages are not. ```json ["English","\u65E5\u672C\u8A9E","\u05E2\u05D1\u05E8\u05D9\u05EA","\u10E5\u10D0\u10E0\u10D7\u10E3\u10DA\u10D8 \u10D4\u10DC\u10D0","\u4E2D\u6587","\uD804\uDD0C\uD804\uDD0B\uD804\uDD34\uD804\uDD1F\uD804\uDD33\uD804\uDD26 \uD804\uDD03\uD804\uDD27\uD804\uDD0F\uD804\uDD1B\uD804\uDD16\uD804\uDD34"] ``` For English speakers, try using the following encoder for a while to get a feel for the significance of this impact on other languages: ```c# using System.Text; using System.Text.Encodings.Web; using System.Text.Json; class MaximalJsonEncoder : JavaScriptEncoder { public override int MaxOutputCharactersPerInputCharacter => 12; public override unsafe int FindFirstCharacterToEncode(char* text, int textLength) { if (textLength == 0) { return -1; } return 0; } public override unsafe bool TryEncodeUnicodeScalar(int unicodeScalar, char* buffer, int bufferLength, out int numberOfCharactersWritten) { Rune rune = new Rune(unicodeScalar); const int unicodeEscapeLength = 6; // "\\u0000".Length int lengthNeeded = rune.Utf16SequenceLength * unicodeEscapeLength; if (bufferLength < lengthNeeded) { numberOfCharactersWritten = 0; return false; } const int maxCharsPerScalar = 2; char* utf16Buffer = stackalloc char[maxCharsPerScalar]; int utf16Length = rune.EncodeToUtf16(new Span(utf16Buffer, maxCharsPerScalar)); Span span = new Span(buffer, bufferLength); for (int index = 0; index < utf16Length; ++index) { char toEncode = utf16Buffer[index]; span[0] = '\\'; span[1] = 'u'; span[2] = ToHexDigit((toEncode & 0xf000) >> 12); span[3] = ToHexDigit((toEncode & 0xf00) >> 8); span[4] = ToHexDigit((toEncode & 0xf0) >> 4); span[5] = ToHexDigit(toEncode & 0xf); span = span.Slice(unicodeEscapeLength); } numberOfCharactersWritten = lengthNeeded; return true; } public override bool WillEncode(int unicodeScalar) { return true; } static char ToHexDigit(int value) { if (value > 0xf) { throw new ArgumentOutOfRangeException(nameof(value)); } if (value <= 10) { return (char)(value + '0'); } else { return (char)(value - 0xa + 'a'); } } } class Program { static void Main() { Console.OutputEncoding = Encoding.UTF8; Dictionary data = new Dictionary(); data.Add("text", "Hello, World!"); string actual = JsonSerializer.Serialize(data, new JsonSerializerOptions { Encoder = new MaximalJsonEncoder() }); Console.WriteLine(actual); } } ``` ### Regression? _No response_ ### Known Workarounds A relaxed encoder helps with some languages but not all: ```c# string actual = JsonSerializer.Serialize(data, new JsonSerializerOptions { Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping }); ``` A custom encoder is currently needed to handle all languages equally. ### Configuration _No response_ ### Other information It could be easy to think that encoding doesn't matter, because it's the same semantic meaning from a programmatic standpoint. But consider the following JSON files: ```json {"text":"Hello, World!"} ``` vs. ```json {"\u0074\u0065\u0078\u0074":"\u0048\u0065\u006c\u006c\u006f\u002c\u0020\u0057\u006f\u0072\u006c\u0064\u0021"} ``` Even though a parser will treat them as meaning the same thing, these are clearly not equally good JSON files, for two key reasons: 1. Performance: Each text character is 6x (!) as long as before. 2. Readability: Unless you have binary-to-Unicode memorized, the second file is unreadable. .NET should treat other languages equally, and not relegate some to be treated like the second example above.
Author: davidmatson
Assignees: -
Labels: `area-System.Text.Encodings.Web`, `untriaged`
Milestone: -
eiriktsarpalis commented 1 year ago

As mentioned in other issues, folks keep using Newtonsoft rather than switching to this API because of these issues.

In order to understand your requirements better, is this issue proposing something different compared to what #42847 is tracking?

Gnbrkm41 commented 1 year ago

Does this give any reason why languages other than English need to be treated as "unsafe" by default?

... More significantly, I don't think saying "we've been biased against other languages in the past, and the complaints haven't registered as significant, so it's fine to keep doing it" is the right call here.

... unless you use an encoder that's biased against other languages and happen to use one of those languages.

Is there a reason there has to be a list at all? Can all JSON characters that aren't required by the RFC to be encoded not be encoded?

Frankly, I'm not sure why you're keep bringing the 'language bias' onto the discussion. It has nothing to do with the language. It is precisely just the encoding, about how characters from those other languages are encoded.

Yes, We escape every non-ASCII characters. Yes, ASCII only has English letters in it. Does it mean we escape every non-ASCII characters because 'we are biased against other languages'? No. It is because having everything encoded in ASCII minimise any chances of anything potentially going wrong. People make all sorts of weird assumptions about characters and encodings which end up with misbehaving programs due to their wrong assumptions.

This kind of mishandling happens all the time with non-English characters everywhere, and is not limited to just the JSON space. One similar example is path handling, when programs don't work when the path includes any space / non-ASCII characters. So common that when you get some program written abroad and it doesn't work, one of the suggested solution apart from "have you tried turning it off then on again" is "make sure the path doesn't include any Hangul". I have no clue how people manage to get this wrong all the time, yet it happens.

What happens if the JSON-encoded string goes through some generic text pre-processor before letting the JSON get transmitted to the client, which doesn't know how to process non-ASCII characters properly and end up mangling those characters? Same thing with the path handling happens. It shouldn't happen, I don't even know how they get it wrong, but due to people making wrong assumptions it happens and ends up not working correctly. The wrong assumption doesn't have to be in your code - it could be in someone else's, and that's what we're trying to protect against by escaping by default.

If anything, I would argue having it escape by default protects us from the impact of the "biases" that other people have against non-English languages resulting in non-ASCII characters getting handled correctly.

Imagine this situation: You want to communicate with an API that takes a JSON text as the body. You use the default escaping behaviour of not escaping anything because "what can possibly go wrong". Only to find out that somewhere in the client's code they manage to mishandle non-ASCII characters and your request is now broken. Now you have to either: A: Make them fix their mistake, or B: Escape your JSON so it can get handled even with their flawed system. The opposite (of client code having problem handling escaped, all-ASCII text), however, is much less likely to happen, and is therefore a better behaviour as "the default behaviour" even though it means your string is now less human readable. You could argue that we shouldn't need to work around someone else's problem, but that's the way they chose when designing S.T.Json.

A major reason to use JSON is that it's human readable...

I disagree on this one. JSON being a subset of JavaScript has a nice side-effect of being mostly human readable, but when exactly do you actually require it to be human readable, and how much of the total use case of JSON consists of that? I don't have any numbers but I bet the majority of the use case is for data interchange between entities or data storage, not in a direct human-interfacing way, like configuration files that people take a look into. I'm curious why you need JSON to be human-readable in your case?

If you really care about human-readability, JSON lacks a lot of features for a true human readable format like single-quoted strings (to allow for literal double quotes within a string), multi-line strings and comments. For a human-readable format, you might want to use JSON5 or YAML instead which is actually designed with human-readability in mind.

Gnbrkm41 commented 1 year ago

To summarize, I see two separate requests in this issue:

  1. Let's make S.T.Json not escape any characters (apart from the ones required) by default.
  2. There are characters that still get escaped even with UnsafeRelaxedJsonEscaping.

I disagree with the first issue. The key point is this: The implications of having something set as the default behaviour has a huge implication. We don't necessarily know where the code is going to be used at, and there is a lot of people that assume the default behaviour is good for everything without thinking about what can go wrong, only to run into problems / vulnerabilities later due to clients that mishandle the data. And there are lots of people that mishandle text encoding / JSON data.

Therefore, for a core framework code like S.T.Json, it is better to aggressively guard against those bad programs by default so that even with those flawed programs it continues working. If consumers of the library know what those guards actually mean and is certain that their code / whomever they're interacting with handles the JSON data correctly without those flaws, they can remove the guard and get less-escaped JSONs that work with their code.

The alternative to the escape-by-default behaviour is to add warnings on the API documents saying "you should escape if you fall under this category...". But people don't read documents and inevitably will run into issues. It is better to design APIs in a way that it just works without causing any major trouble - The escape-everything-by-default behaviour works fine with any spec-compliant deserializers and also makes non-compliant deserializers work instead of break. The only inconvenience is the human readability, but for the 'default behaviour' causing the least amount of issue should be first goal, not human-readability.

Regarding the second issue, I do think it would be nice to solve, because even with all what I've said I myself have used JSON in a human-interfacing situation a few times (for a config file). However, I'm not too sure how beneficial it would be given the cost involved for implementing / maintaining such features compared to the benefit it brings which is really just a nicety feature of being able to produce more human-friendly JSONs. In the original issue talking about emojis (#42847) two people who actually had problem with the escaping behaviour turned out to be having problems elsewhere that had nothing to do with S.T.Json.

davidmatson commented 1 year ago

As mentioned in other issues, folks keep using Newtonsoft rather than switching to this API because of these issues.

In order to understand your requirements better, is this issue proposing something different compared to what #42847 is tracking?

Yes, that issue proposes having emoji/non-BMP characters be able not to be escaped, whereas this issue is about what is escaped by default.

davidmatson commented 1 year ago

Frankly, I'm not sure why you're keep bringing the 'language bias' onto the discussion. It has nothing to do with the language. It is precisely just the encoding, about how characters from those other languages are encoded.

Because this decision negatively impacts other languages, in terms of size-on-disk, size-in-memory, and readability, and we wrote code that goes out of our way to treat such languages differently.

Yes, We escape every non-ASCII characters.

Having ASCII as the basis is more understandable than most of the way this is documented (which just talks about some Latin characters being the only ones allowed).

What happens if the JSON-encoded string goes through some generic text pre-processor before letting the JSON get transmitted to the client, which doesn't know how to process non-ASCII characters properly and end up mangling those characters? Same thing with the path handling happens. It shouldn't happen, I don't even know how they get it wrong, but due to people making wrong assumptions it happens and ends up not working correctly. The wrong assumption doesn't have to be in your code - it could be in someone else's, and that's what we're trying to protect against by escaping by default.

If anything, I would argue having it escape by default protects us from the impact of the "biases" that other people have against non-English languages resulting in non-ASCII characters getting handled correctly.

Imagine this situation: You want to communicate with an API that takes a JSON text as the body. You use the default escaping behaviour of not escaping anything because "what can possibly go wrong". Only to find out that somewhere in the client's code they manage to mishandle non-ASCII characters and your request is now broken. Now you have to either: A: Make them fix their mistake, or B: Escape your JSON so it can get handled even with their flawed system. The opposite (of client code having problem handling escaped, all-ASCII text), however, is much less likely to happen, and is therefore a better behaviour as "the default behaviour" even though it means your string is now less human readable. You could argue that we shouldn't need to work around someone else's problem, but that's the way they chose when designing S.T.Json.

In the original issue talking about emojis (https://github.com/dotnet/runtime/issues/42847) two people who actually had problem with the escaping behaviour turned out to be having problems elsewhere that had nothing to do with S.T.Json.

We can't have it both ways here. If the justification for this approach is, "it shouldn't be needed, but we do it to avoid bugs in other people's code," that reasoning also applies to the emoji bug mentioned as well. The extra escaping also causes problems due to bugs in other people's code (in addition to the performance and readability impact). If we want to avoid such problems, we need to avoid unnecessary escaping.

If you really care about human-readability, JSON lacks a lot of features for a true human readable format like single-quoted strings (to allow for literal double quotes within a string), multi-line strings and comments. For a human-readable format, you might want to use JSON5 or YAML instead which is actually designed with human-readability in mind.

Let's avoid the "if you really care about" tone.

Choosing JSON given its human readability is fully compatible with choosing JSON rather than another language that might be have additional advantages but also disadvantages.

I disagree on this one. JSON being a subset of JavaScript has a nice side-effect of being mostly human readable, but when exactly do you actually require it to be human readable, and how much of the total use case of JSON consists of that? I don't have any numbers but I bet the majority of the use case is for data interchange between entities or data storage, not in a direct human-interfacing way, like configuration files that people take a look into. I'm curious why you need JSON to be human-readable in your case?

I don't think .NET should be making assumptions about how customers use JSON. I used it for test data from other languages, and I couldn't read the data, making .NET's default unusable for my purpose. But one customer's specific purpose shouldn't be the issue - .NET shouldn't assume customers don't care about human-readability of JSON.

RE: "I disagree on this one. JSON being a subset of JavaScript has a nice side-effect of being mostly human readable,"

Note that the RFC errata indicates JSON isn't fully a subset of JavaScript.

More significantly, key sources say otherwise on regarding the human-readability of JSON.

Wikipedia's first sentence in the article about JSON:

JSON (JavaScript Object Notation, pronounced /ˈdʒeɪsən/; also /ˈdʒeɪˌsɒn/) is an open standard file and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays or other serializable values).

And JSON.org:

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.

Note that "easy for humans to read and write" comes before "easy for machines to parse and generate."

Prior art here (Json.NET) maintains this property of "human-readable text" / being "easy for humans to read" in all languages.

davidmatson commented 1 year ago

To summarize, I see two separate requests in this issue:

  1. Let's make S.T.Json not escape any characters (apart from the ones required) by default.
  2. There are characters that still get escaped even with UnsafeRelaxedJsonEscaping.

This issue is specifically about the first item. I filed a separate bug regarding precisely the second item, #86463, and then when the team repeatedly declined to re-open that issue, a request to provide an implementation that doesn't have this extra escaping, #86800, which the team closed as well.

I disagree with the first issue. The key point is this: The implications of having something set as the default behaviour has a huge implication. We don't necessarily know where the code is going to be used at, and there is a lot of people that assume the default behaviour is good for everything without thinking about what can go wrong, only to run into problems / vulnerabilities later due to clients that mishandle the data. And there are lots of people that mishandle text encoding / JSON data.

The performance and human-readability implications are the issue here - because .NET "doesn't know where the code is going to be used," concluding performance and human-readability of non-English characters are unimportant seems unwarranted.

Therefore, for a core framework code like S.T.Json, it is better to aggressively guard against those bad programs by default so that even with those flawed programs it continues working. If consumers of the library know what those guards actually mean and is certain that their code / whomever they're interacting with handles the JSON data correctly without those flaws, they can remove the guard and get less-escaped JSONs that work with their code.

As the second issue indicates, there is currently no easy way to do so. And, again, guarding against bad programs goes both ways; as noted above, over-escaping causes bad behavior with some programs as well.

The alternative to the escape-by-default behaviour is to add warnings on the API documents saying "you should escape if you fall under this category...". But people don't read documents and inevitably will run into issues. It is better to design APIs in a way that it just works without causing any major trouble - The escape-everything-by-default behaviour works fine with any spec-compliant deserializers and also makes non-compliant deserializers work instead of break. The only inconvenience is the human readability, but for the 'default behaviour' causing the least amount of issue should be first goal, not human-readability.

Again, the issue of correctness goes both ways.

Regarding the second issue, I do think it would be nice to solve, because even with all what I've said I myself have used JSON in a human-interfacing situation a few times (for a config file). However, I'm not too sure how beneficial it would be given the cost involved for implementing / maintaining such features compared to the benefit it brings which is really just a nicety feature of being able to produce more human-friendly JSONs. In the original issue talking about emojis (#42847) two people who actually had problem with the escaping behaviour turned out to be having problems elsewhere that had nothing to do with S.T.Json.

We can't both claim these problems "had nothing to do with S.T.Json" and also have us "guard against" "bad programs by default."

As noted above, key descriptions of JSON emphasize human-readability, even over ease of machine generation. Trying out the maximal JSON encoder for a while may help to understand the significance of this issue. I'm hoping that if those making a decision primarily use English, we can work to get the same outcome that would have occurred if a non-English speaker were considering this issue.

eiriktsarpalis commented 1 year ago

I'm still struggling to understand what this issue is proposing compared to https://github.com/dotnet/runtime/issues/42847. We clearly can't make this the default since beyond the concerns already voiced in this thread it would also be a breaking change. So if this is about exposing a new JavaScriptEncoder imlpementation that doesn't encode surrogates or characters from the global block list, this is tracked by https://github.com/dotnet/runtime/issues/42847.

davidmatson commented 1 year ago

I'm still struggling to understand what this issue is proposing compared to #42847. We clearly can't make this the default since beyond the concerns already voiced in this thread it would also be a breaking change. So if this is about exposing a new JavaScriptEncoder imlpementation that doesn't encode surrogates or characters from the global block list, this is tracked by #42847.

This issue is about changing the default, given the negative impact on other languages.

Gnbrkm41 commented 1 year ago

Having ASCII as the basis is more understandable than most of the way this is documented (which just talks about some Latin characters being the only ones allowed).

It is documented as escaping all non-ASCII characters by default:

By default, the serializer escapes all non-ASCII characters.

Let's avoid the "if you really care about" tone. Choosing JSON given its human readability is fully compatible with choosing JSON rather than another language that might be have additional advantages but also disadvantages.

I think we're just having different opinions here - and I can totally understand where you are coming from. I'll stop arguing about this point since I think this would lead to an endless debate.

I don't think .NET should be making assumptions about how customers use JSON.

While I do agree that we generally shouldn't really care about someone else's problem, that kind of decisions can have a wide impact on the consumers of the API as it's used by millions of people. Security folks at MSFT apparently decided it is an easy-enough mistake to make & important enough to take such drastic measures, I guess (I also would like to hear more about the exact details / reasonings behind such measures, just for curiosity's sake).

Note that the RFC errata indicates JSON isn't fully a subset of JavaScript.

Technically true, I did omit saying that because it's pretty much an exact subset apart from those two, the later versions of the language was changed to allow those, and I thought it was kind of irrelevant for our discussion.

davidmatson commented 1 year ago

I'm still struggling to understand what this issue is proposing compared to #42847. We clearly can't make this the default since beyond the concerns already voiced in this thread it would also be a breaking change. So if this is about exposing a new JavaScriptEncoder imlpementation that doesn't encode surrogates or characters from the global block list, this is tracked by #42847.

And regarding better options outside the default, #42847 is only a subset and more focused on changing the existing UnsafeRelaxedJsonEscaping implemnetation. #86800 is about providing a new implementation designed not to do more escaping than necessary - but that issue was closed, and no issue currently tracks the "provide an implementation with no unnecessary escaping" part.

davidmatson commented 1 year ago

While I do agree that we generally shouldn't really care about someone else's problem, that kind of decisions can have a wide impact on the consumers of the API as it's used by millions of people. Security folks at MSFT apparently decided it is an easy-enough mistake to make & important enough to take such drastic measures, I guess (I also would like to hear more about the exact details / reasonings behind such measures, just for curiosity's sake).

Yeah, that's the main point here. I think the defense-in-depth security decision incorrectly chose to penalize other languages, just for a defense-in-depth measure that is widely avoided elsewhere (such as Json.NET and XmlSerializer). I think treating other languages equally is more important in choosing a default than extra defense-in-depth.

Gnbrkm41 commented 1 year ago

.... Apart from all the reasons why / why not it should escape by default, the thing is, I am 99% certain the idea of changing the default is simply not going to happen. .NET has a strict guideline for breaking changes, which include:

Bucket 2: Reasonable Grey Area

Change of behavior that customers would have reasonably depended on. Examples: ...

  • A different behavior is observed after the change for an input ...

These require judgment: how predictable, obvious, consistent was the behavior?

Given that the security folks made the explicit decision to make it escape a lot of characters by default & it is very well documented all around, I think it is very unlikely that this would pass the bar for making breaking changes.

davidmatson commented 1 year ago

I should also correct/clarify one other statement earlier - this issue is not about only escaping required characters by default. The Default implementation chose to escape some characters to provide defense-in-depth for devs forgetting to escape when placing JSON inside HTML, for example, by escaping < and >. The choice to escape some symbols that are handled specially in other languages is not the issue here. The only issue here is the choice to escape characters in other languages by default, treating them as second-class compared to English characters.

davidmatson commented 1 year ago

.... Apart from all the reasons why / why not it should escape by default, the thing is, I am 99% certain the idea of changing the default is simply not going to happen. .NET has a strict guideline for breaking changes, which include:

Bucket 2: Reasonable Grey Area

Change of behavior that customers would have reasonably depended on. Examples: ...

  • A different behavior is observed after the change for an input ...

These require judgment: how predictable, obvious, consistent was the behavior?

Given that the security folks made the explicit decision to make it escape a lot of characters by default & it is very well documented all around, I think it is very unlikely that this would pass the bar for making breaking changes.

How likely would folks have taken a dependency on Greek characters being escaped by default? Is that likelihood more important than the negative default impact on users in other languages?

I understand back-compat is an important consideration. I'm hoping that decision is made after first fully understanding the impact on other languages, and then after providing a new implementation that would be the default if back-compat hadn't required otherwise.

davidmatson commented 1 year ago

Note also that docs also indicate users should not take a dependency on some details here and expect a change in every release:

The global block list is an implementation detail that has changed in every release of .NET Core and in .NET 5. Don't take a dependency on a character being a member of (or not being a member of) the global block list.

If we can change some aspects of what characters are escaped in each release, I think we should also consider whether changing Greek, for example, between escaped and non-escaped is the right decision.

Gnbrkm41 commented 1 year ago

How likely would folks have taken a dependency on Greek characters being escaped by default?

It doesn't just have to be Greek / any other languages for that matter, if anyone took dependency on its aggressive escaping behaviour & made assumptions / depended on it their code will now be broken!

Note also that docs also indicate users should not take a dependency on some details here and expect a change in every release:

Right, but that only applies to the global block list which is totally out of scope for what we are talking about (characters for other languages in general). As the document I quoted above says, there's documentations that clearly states the default behaviour is that it escapes all non-ASCII characters.

just for a defense-in-depth measure that is widely avoided elsewhere (such as Json.NET and XmlSerializer)

I'd like to think differently - XML has ways to specify which charset is used to encode the encoding used (<?xml version="1.0" encoding="UTF-8"?>) and therefore the encoding problem is much less likely to happen. JSON.NET started out as a personal project originally before it grew to be what it is now currently, it was first released many years ago when the JSON format was still relatively new (2006!) so likely the issues were not known well when the default behaviour got established. The built-in escape support got added 6 years after its initial release so the idea of escaping was out of the question back then. Designing a completely new API from scratch with better understanding about what can go wrong gives us an opportunity to review those potential pitfalls and design it to avoid those potential problems - As part of a core library being depended on by millions of people, we definitely should have knowledge about how the API is going to be used & what can go wrong so we can prevent people from making mistakes.

eiriktsarpalis commented 1 year ago

Irrespective of what you believe should have been the correct default behavior, is there a specific use case where you are being blocked by the current defaults?

davidmatson commented 1 year ago

Irrespective of what you believe should have been the correct default behavior, is there a specific use case where you are being blocked by the current defaults?

Default behavior is not blocking - one can opt out of the default.

However, getting to what I believe should be the correct default is currently very difficult. 1) Even opting in to the other built-in implementation doesn't resolve it (#86463), 2) and even if it did, folks will wrongly thing it's "unsafe" (#87138), 3) there is no built-in implementation that just follows the specification's requirements (#86800), and 4) writing such an implementation is unnecessarily complicated (#86810), including needing to go back to the spec to encode some characters or get broken JSON output, as the current interface mixes concerns between choosing which optional characters to encode and how to do so vs getting actually valid JSON output.

Gnbrkm41 commented 1 year ago

I really do think most languages get covered by using UnsafeRelaxedJsonEscaping though, and is simple enough (taken from the docs):

var options3 = new JsonSerializerOptions
{
    Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping,
};
jsonString = JsonSerializer.Serialize(weatherForecast, options3);

If you think typing JsonSerializerOptions for every occurence is cumbersome, note that the options object can be stored somewhere (in fact it's recommended to reuse options instead of creating new ones everytime).

Putting the discussion about what should be the correct default behaviour aside, is there anything else that prevents you from just using UnsafeRelaxedJsonEscaping, behaviour-wise? Apart from #42847, I can't clearly see why there needs to be a new API surface (proposed by #86810 or #86800 or #86463) and why there needs to be an implementation that doesn't escape anything other than ones required by the spec.

davidmatson commented 1 year ago

How likely would folks have taken a dependency on Greek characters being escaped by default?

It doesn't just have to be Greek / any other languages for that matter, if anyone took dependency on its aggressive escaping behaviour & made assumptions / depended on it their code will now be broken!

My point is about what if we changed specifically Greek, not if we changed things like >, ', " (as hex,) and <. To be very clear, this issue is not about changing the aggressive escaping behavior of such symbols.

Note also that docs also indicate users should not take a dependency on some details here and expect a change in every release:

Right, but that only applies to the global block list which is totally out of scope for what we are talking about (characters for other languages in general). As the document I quoted above says, there's documentations that clearly states the default behaviour is that it escapes all non-ASCII characters.

The point is we do expect some default encoding behavior to change with each release. Given the impact on other languages, I think it's worth considering whether this specific default encoding behavior can also change - I think communicating "stuff in this overall space changes all the time; don't take dependencies" likely makes back-compat less severe of a question than usual.

I'd like to think differently - XML has ways to specify which charset is used to encode the encoding used (<?xml version="1.0" encoding="UTF-8"?>) and therefore the encoding problem is much less likely to happen.

I don't think that understanding is correct. Both RFC and ECMA specs define JSON as a sequence of Unicode characters/code points; it doesn't really need an "encoding" attribute to make it Unicode since that's part of the spec. (The RFC, unlike ECMA, even goes so far as to say that it must always use specifically the UTF-8 encoding of Unicode.)

JSON.NET started out as a personal project originally before it grew to be what it is now currently, it was first released many years ago when the JSON format was still relatively new (2006!) so likely the issues were not known well when the default behaviour got established. [The built-in escape support got added 6 years after its initial release(https://github.com/JamesNK/Newtonsoft.Json/commit/e3be3e43dbb94e50d6c539d27872e17727c733ea) so the idea of escaping was out of the question back then. Designing a completely new API from scratch with better understanding about what can go wrong gives us an opportunity to review those potential pitfalls and design it to avoid those potential problems - As part of a core library being depended on by millions of people, we definitely should have knowledge about how the API is going to be used & what can go wrong so we can prevent people from making mistakes.

As noted above, we're trading one set of bugs (in other's code) for another, at the expense of other languages.

davidmatson commented 1 year ago

I really do think most languages get covered by using UnsafeRelaxedJsonEscaping though, and is simple enough (taken from the docs):

var options3 = new JsonSerializerOptions
{
    Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping,
};
jsonString = JsonSerializer.Serialize(weatherForecast, options3);

If you think typing JsonSerializerOptions for every occurence is cumbersome, note that the options object can be stored somewhere (in fact it's recommended to reuse options instead of creating new ones everytime).

Putting the discussion about what should be the correct default behaviour aside, is there anything else that prevents you from just using UnsafeRelaxedJsonEscaping, behaviour-wise? Apart from #42847, I can't clearly see why there needs to be a new API surface (proposed by #86810 or #86800 or #86463) and why there needs to be an implementation that doesn't escape anything other than ones required by the spec.

UnsafeRelaxedJsonEscaping doesn't handle all the languages that matter, and simply adding Emoji won't fix that problem. As the above comment notes, folks wrongly think it's wrong to use that class anyway, and implementing the desired behavior correctly is complicated, and involves the risk of generating non-JSON output.

davidmatson commented 1 year ago

I really do think most languages get covered by using UnsafeRelaxedJsonEscaping though,

I believe .NET can do better than only covering most languages, and even then not via the default, especially when we're writing in a data format defined as Unicode. The only reason we're not covering all languages is because of extra code we've written.

davidmatson commented 1 year ago

Irrespective of what you believe should have been the correct default behavior, is there a specific use case where you are being blocked by the current defaults?

But this issue is specifically about what the default behavior should be. The other issues cover other questions, such as what non-default implementation is available, and how hard a custom implementation is to write. Let's consider those questions there.

Gnbrkm41 commented 1 year ago

So the decision really falls down to:

  1. We do the minimal amount of work required, and let the users decide whether they need additional countermeasures against those problems.
  2. We go out of our way to prevent people from shooting themselves in the foot. If they wish to opt-out of those safefy measures, they can do so, but at their own risk (ideally, assessing the risk before doing so).

The approach taken in S.T.Json is 2. Because it is an opt-out system, users will get to know what kind of risks exist by opting to not escape, assess if the situation is applicable for them, and then use the code if it is safe to do so in their situation. The reason it has Unsafe in its name is the same as how you need unsafe to use pointers in C#. The use of it doesn't inherently make it unsafe (if it were, it shouldn't be in the BCL) - It is a powerful tool that definitely has its uses, but you must fully understand the implications and pitfalls that exist by choosing to use it. It is a strong suggestion that you must know what you are doing instead of blindingly using it everywhere. If you don't have problems with encodings, fine, use the type! but otherwise you would need to be careful using it (Clearly the warning is doing its job too well 😄).

What you are suggesting appears to be 1 (not the languages one, the minimal escaping APIs and alikes) , but it can be hard to even know that such problems might exist in the first place without someone telling them, and when the problem does surface, it's going to be late (and can have serious implications). For personal projects / third-party libraries providing such decisions might be fine but I don't think it meets the bar of standards in the BCL.

simply adding Emoji won't fix that problem.

I think the title of that issue should be renamed - the underlying limitation is that we don't encode anything outside of BMP because the cost of adding the table of allowed characters would be too immense, not just because they're emojis. If we fix it for emojis, we ought to fix it for other SMP characters as well.

Whether or not to make that the default behaviour aside, to be crystal clear: it is unclear whether you want an implementation that escapes nothing or implementations that doesn't escape common, reasonable alphabet characters used in many different countries. I believe you want the latter and I would agree on that, but your other issues and suggestions seem to suggest that you also want the former so I'm confused. I assume most of the "unsafed-ness" comes from Html escapes and potentially other invisible / control characters (assuming the encoding is handled properly, correct me if I'm wrong) - would providing a built-in encoder implementation that leaves most (not necessarily English) alphabets and alikes alone but escapes those potentially dangerous special characters solve your issue?

But then again, docs for the UnsafeRelaxedJsonEscaping explicitly mentions you may get XSS / information disclosure attacks if you happen to mix up encoding so I guess it happens frequently enough in the real world to warrant such measures... I really don't know enough about this particular one. Perhaps what we really need is better documentation for the type, mentioning when it is safe to use and such?

Gnbrkm41 commented 1 year ago

Thinking about it while writing, I guess it not being the default, the name having Unsafe and the docs really does mean just that: Don't use it unless you are 100% certain the text encoding is not going to be mixed up, and you are not including it as part of HTML (As in, you'd think your software / stack is going to be fine, but you better double-check, without an assumption, that everything handles it correctly without mixing up encoding). I guess the risk seem too unreal / irrelevant for people to consider and they don't immediately realise it really just means that, so they get spooked when they see the word Unsafe as part of the type name.

davidmatson commented 1 year ago

Thinking about it while writing, I guess it not being the default, the name having Unsafe and the docs really does mean just that: Don't use it unless you are 100% certain the text encoding is not going to be mixed up, and you are not including it as part of HTML (As in, you'd think your software / stack is going to be fine, but you better double-check, without an assumption, that everything handles it correctly without mixing up encoding). I guess the risk seem too unreal / irrelevant for people to consider and they don't immediately realise it really just means that, so they get spooked when they see the word Unsafe as part of the type name.

To clarify, I think that's a question about what we name other implementations, rather than a question of what the default behavior is. I'll leave that discussion for #87138 to avoid making the discussion here, about the default behavior, even longer : )

So the decision really falls down to:

  1. We do the minimal amount of work required, and let the users decide whether they need additional countermeasures against those problems.
  2. We go out of our way to prevent people from shooting themselves in the foot. If they wish to opt-out of those safefy measures, they can do so, but at their own risk (ideally, assessing the risk before doing so).

The approach taken in S.T.Json is 2. Because it is an opt-out system, users will get to know what kind of risks exist by opting to not escape, assess if the situation is applicable for them, and then use the code if it is safe to do so in their situation. The reason it has Unsafe in its name is the same as how you need unsafe to use pointers in C#. The use of it doesn't inherently make it unsafe (if it were, it shouldn't be in the BCL) - It is a powerful tool that definitely has its uses, but you must fully understand the implications and pitfalls that exist by choosing to use it. It is a strong suggestion that you must know what you are doing instead of blindingly using it everywhere. If you don't have problems with encodings, fine, use the type! but otherwise you would need to be careful using it (Clearly the warning is doing its job too well 😄).

What you are suggesting appears to be 1 (not the languages one, the minimal escaping APIs and alikes) , but it can be hard to even know that such problems might exist in the first place without someone telling them, and when the problem does surface, it's going to be late (and can have serious implications). For personal projects / third-party libraries providing such decisions might be fine but I don't think it meets the bar of standards in the BCL.

I think either option 1 or 2 is compatible with this question - which is just whether we escape characters in other languages, not whether we escape symbols or not. I'm not suggesting that .NET change the default behavior of escaping things like HTML symbols.

simply adding Emoji won't fix that problem.

I think the title of that issue should be renamed - the underlying limitation is that we don't encode anything outside of BMP because the cost of adding the table of allowed characters would be too immense, not just because they're emojis. If we fix it for emojis, we ought to fix it for other SMP characters as well.

Yeah, if we could retitle that issue, say, to match #86800, that would be great!

Whether or not to make that the default behaviour aside, to be crystal clear: it is unclear whether you want an implementation that escapes nothing or implementations that doesn't escape common, reasonable alphabet characters used in many different countries. I believe you want the latter and I would agree on that, but your other issues and suggestions seem to suggest that you also want the former so I'm confused. I assume most of the "unsafed-ness" comes from Html escapes and potentially other invisible / control characters (assuming the encoding is handled properly, correct me if I'm wrong) - would providing a built-in encoder implementation that leaves most (not necessarily English) alphabets and alikes alone but escapes those potentially dangerous special characters solve your issue?

In this issue, an implementation that doesn't escape alphabet characters of any language.

I'd also like an inbox implementation that doesn't escape anything beyond what's required (#86800), but that's different from this question of what the default should be.

But then again, docs for the UnsafeRelaxedJsonEscaping explicitly mentions you may get XSS / information disclosure attacks if you happen to mix up encoding so I guess it happens frequently enough in the real world to warrant such measures... I really don't know enough about this particular one. Perhaps what we really need is better documentation for the type, mentioning when it is safe to use and such? Thinking about it while writing, I guess it not being the default, the name having Unsafe and the docs really does mean just that: Don't use it unless you are 100% certain the text encoding is not going to be mixed up, and you are not including it as part of HTML (As in, you'd think your software / stack is going to be fine, but you better double-check, without an assumption, that everything handles it correctly without mixing up encoding). I guess the risk seem too unreal / irrelevant for people to consider and they don't immediately realise it really just means that, so they get spooked when they see the word Unsafe as part of the type name.

Yeah, I think that a question about #87138 rather than the question of the default here. Agreed docs are part of the answer.

eiriktsarpalis commented 1 year ago

Closing in favor of https://github.com/dotnet/runtime/issues/87153.