dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.85k stars 4.62k forks source link

non-UTF8 encoded JSON docs to throw NotSupportedException #1572

Open am11 opened 5 years ago

am11 commented 5 years ago

I am in the process of adding System.Text.Json parser to JSONTestSuite: https://github.com/am11/JSONTestSuite/tree/feature/dotnet-system-text-json/parsers/test_dotnet_system_text_json.

Currently, the following four tests are failing:

  1. i_string_UTF-16LE_with_BOM.json
    • System.Text.Json.JsonReaderException: '0xFF' is an invalid start of a value. LineNumber: 0 | BytePositionInLine: 0.

  2. i_string_utf16BE_no_BOM.json
    • System.Text.Json.JsonReaderException: '0x00' is an invalid start of a value. LineNumber: 0 | BytePositionInLine: 0.

  3. i_string_utf16LE_no_BOM.json
    • System.Text.Json.JsonReaderException: '0x00' is an invalid start of a value. LineNumber: 0 | BytePositionInLine: 1.

  4. i_structure_500_nested_arrays.json
    • System.Text.Json.JsonReaderException: The maximum configured depth of 64 has been exceeded. Cannot read next JSON array. LineNumber: 0 | BytePositionInLine: 64.

    • AFAICT, this limit is not adjustable by API consumer.

Should NotSupportedException be thrown in case of 1, 2 and 3 with message indicating that non-UTF8 encodings are unsupported?

I found two places, where similar white-listing / checking for encoding is employed:

  1. https://github.com/dotnet/corefx/blob/f4344b0d3b45fc7c8f41e5c909ddcb2752bea0fa/src/System.Private.DataContractSerialization/src/System/Xml/EncodingStreamWrapper.cs#L22
  2. https://github.com/dotnet/corefx/blob/f4344b0d3b45fc7c8f41e5c909ddcb2752bea0fa/src/System.Private.DataContractSerialization/src/System/Runtime/Serialization/Json/JsonEncodingStreamWrapper.cs#L57-L63
ahsonkhan commented 5 years ago

I am in the process of adding System.Text.Json parser to JSONTestSuite

Awesome! cc @layomia

From the underlying Utf8JsonReader's perspective:

Should NotSupportedException be thrown in case of 1, 2 and 3 with message indicating that non-UTF8 encodings are unsupported?

Maybe, but the reader is called Utf8JsonReader, which implies it expected UTF-8 data (it's in the type name). Also, I don't know if encoding detection would be accurate and not hurt the performance of the common path. I think JsonReaderException for invalid JSON makes sense and when I see a NotSupportedException, I tend to think that the scenario could potentially be supported, and might be in the future, but the Utf8JsonReader would not support non-UTF8 encodings.

AFAICT, this limit is not adjustable by API consumer.

JsonReaderOptions lets you set the MaxDepth, that can be passed in to the reader: https://github.com/dotnet/corefx/blob/a1578465f68f6793e7498152dc2d0d2f4eed3a4e/src/System.Text.Json/ref/System.Text.Json.cs#L335

Similarly, the JsonDocumentOptions has the same setting.

That said, from the JsonDocument's perspective, which is currently encoding agnostic, it might be worth documenting clearly that the UTF-16 encoded text is currently not supported (with/without BOM), and maybe we can detect that up-front and throw a more meaningful exception.

cc @bartonjs - thoughts on making JsonDocument throw NSE for UTF-16 text?

am11 commented 5 years ago

MaxDepth

Thanks, I was looking at the wrong places with wrong keywords for this option. 😄 #4 is now fixed in branch.

bartonjs commented 5 years ago

thoughts on making JsonDocument throw NSE for UTF-16 text?

If that's the plan for the future we should change the name to stream (currently it's utf8Stream). And, really, once it's going to throw the exception it could just finish with using the correct encoding to transform to char, and call the char-based overload with the resulting string. (It's memory-aggressive, but might be better than an NSE?)

bartonjs commented 5 years ago

I think it's a reasonable change, though. If we make it like today.

am11 commented 5 years ago

And, really, once it's going to throw the exception it could just finish with using the correct encoding to transform to char, and call the char-based overload with the resulting string.

IMO, that would be a reasonable thing to do and will address https://github.com/dotnet/corefx/issues/39433.

(It's memory-aggressive, but might be better than an NSE?)

This could be an acceptable tradeoff for legacy interoperability, as the latest JSON format RFC 8259 mentions:

8.1. Character Encoding

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8.

ahsonkhan commented 4 years ago

Given where we are at for 3.0, this doesn't meet the bar, particularly because it requires a breaking change to rename the parameter from utf8Json to stream.

JsonDocument and JsonSerializer accept streams that only contain UTF-8 encoded data (either if a UTF-8 BOM is present OR no BOM is present), and that's indicated by the parameter name.

We can re-consider making this change in the future and introduce logic to do BOM-detection for other encodings (UTF-16-LE/BE and UTF-32-LE/BE). Alternatively, we can consider adding the capability for the user to specify an encoding to fully support UTF-16 text (even if there is no UTF-16 BOM detected), similar to StreamReader (and to address https://github.com/dotnet/corefx/issues/39433).

Moving to 5.0.

ahsonkhan commented 4 years ago

Given we are planning to add a transcoding stream (https://github.com/dotnet/runtime/issues/30260), there is less need for supporting non-UTF-8 encoded streams in the serializer directly. The primary benefit of doing it natively would be performance. Moving to future.

Yomodo commented 3 years ago

Exactly the exception I get. Too bad.

var gitDesc = new GitVersionDescriptor {Version = "main"};

         foreach (var item in items.OrderBy(j => j.Path))
         {
            var fileName = Path.GetFileName(item.Path);

            using var stream = _gitClient.GetItemContentAsync(repo.Id, $"/{TemplatesRoot}/{fileName}", versionDescriptor: gitDesc);

            using var task5 = JsonDocument.ParseAsync(await stream, MyModule.JsonDocumentOptions);
            using var document = await task5;

            yield return document.RootElement.Clone();
         }

System.Text.Json.JsonReaderException: '0xFF' is an invalid start of a value. LineNumber: 0 | BytePositionInLine: 0.
   at System.Text.Json.ThrowHelper.ThrowJsonReaderException(Utf8JsonReader& json, ExceptionResource resource, Byte nextByte, ReadOnlySpan`1 bytes)
   at System.Text.Json.Utf8JsonReader.ConsumeValue(Byte marker)
   at System.Text.Json.Utf8JsonReader.ReadFirstToken(Byte first)
   at System.Text.Json.Utf8JsonReader.ReadSingleSegment()
   at System.Text.Json.Utf8JsonReader.Read()
   at System.Text.Json.JsonDocument.Parse(ReadOnlySpan`1 utf8JsonSpan, JsonReaderOptions readerOptions, MetadataDb& database, StackRowStack& stack)
   at System.Text.Json.JsonDocument.Parse(ReadOnlyMemory`1 utf8Json, JsonReaderOptions readerOptions, Byte[] extraRentedBytes)
   at System.Text.Json.JsonDocument.ParseAsyncCore(Stream utf8Json, JsonDocumentOptions options, CancellationToken cancellationToken)