dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

`JsonEncodingStreamWrapper` cannot handle UTF16 BOMs #80160

Open dougbu opened 1 year ago

dougbu commented 1 year ago

Description

If DataContractJsonSerializer is given a non-UTF8 Stream containing a byte order mark and not given a specific encoding, it will attempt auto-detection of the encoding. This eventually calls the code at https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.DataContractSerialization/src/System/Runtime/Serialization/Json/JsonEncodingStreamWrapper.cs#L474-L503 and incorrectly "detects" UTF8, causing later decoding issues.

Reproduction Steps

  1. dotnet new console --output DCSTest
  2. Add <PackageReference Include="System.Runtime.Serialization.Json" Version="4.3.0" /> to an item group in the project file
  3. Place the code below in Program.cs
  4. dotnet run
using System.Runtime.Serialization.Json;
using System.Text;

using var stream = new MemoryStream();
using var writer = new StreamWriter(stream, new UnicodeEncoding(bigEndian: false, byteOrderMark: true));
writer.WriteLine("{ \"AnInt\": 42 }");
writer.Flush();
stream.Position = 0;

var serializer = new DataContractJsonSerializer(typeof(Simple));
serializer.ReadObject(stream);

public class Simple
{
    int AnInt { get; set; }
}

Expected behavior

Successful execution of the project.

Actual behavior

Program throws:

Unhandled exception. System.Runtime.Serialization.SerializationException: There was an error deserializing the object of type Simple. Encountered unexpected character 'ÿ'.
 ---> System.Xml.XmlException: Encountered unexpected character 'ÿ'.
   at System.Xml.XmlExceptionHelper.ThrowXmlException(XmlDictionaryReader reader, XmlException exception)
   at System.Runtime.Serialization.Json.XmlJsonReader.ReadAttributes()
   at System.Runtime.Serialization.Json.XmlJsonReader.Read()
   at System.Xml.XmlBaseReader.IsStartElement()
   at System.Xml.XmlBaseReader.IsStartElement(XmlDictionaryString localName, XmlDictionaryString namespaceUri)
   at System.Runtime.Serialization.Json.DataContractJsonSerializer.InternalIsStartObject(XmlReaderDelegator reader)
   at System.Runtime.Serialization.Json.DataContractJsonSerializer.InternalReadObject(XmlReaderDelegator xmlReader, Boolean verifyObjectName)
   at System.Runtime.Serialization.XmlObjectSerializer.ReadObjectHandleExceptions(XmlReaderDelegator reader, Boolean verifyObjectName, DataContractResolver dataContractResolver)
   --- End of inner exception stack trace ---
   at System.Runtime.Serialization.XmlObjectSerializer.ReadObjectHandleExceptions(XmlReaderDelegator reader, Boolean verifyObjectName, DataContractResolver dataContractResolver)
   at Program.<Main>$(String[] args) in C:\dd\Projects\DCSTest\Program.cs:line 11

Regression?

Maybe. I haven't tested w/ older versions of the package. [Edit - @StephenMolloy] Looks like this has been an issue since at least 4.8, so not a regression.

Known Workarounds

  1. Remove the byte order mark before passing anything to the serializer
  2. Use Encoding.GetTranscodedStream if targeting a recent-enough TFM (5.0 or later) instead of relying on the serializer to auto-detect
  3. Use JsonReaderWriterFactory if targeting pretty much anything other than pre-netstandard2.0 TFMs and specify the encoding explicitly

My core recommendation here is actually to remove JsonEncodingStreamWrapper and the XML EncodingStreamWrapper. Use TranscodingStream under the covers. And detect the encoding (when necessary) some other way, perhaps using something bulletproof like DetectEncoding() in StreamReader.

That recommendation relates to my need to use DCS for both JSON and XML in netstandard1.3 projects. In addition, this would remove unnecessary encoding restrictions, support non-UTF8 XML deserialization without the (silly?) XML declaration requirement, and simplify your code.

Configuration

No response

Other information

No response

teo-tsirpanis commented 1 year ago

System.Runtime.Serialization.Json is a super old NuGet package from the .NET Core 1.x era and does not need to be installed in modern .NET applications. Not that it would fix the bug, the inbox assemblies would still be used because they have a higher version.

dougbu commented 1 year ago

System.Runtime.Serialization.Json is a super old NuGet package from the .NET Core 1.x era and does not need to be installed in modern .NET applications. Not that it would fix the bug, the inbox assemblies would still be used because they have a higher version.

  1. the package remains necessary when targeting netstandard1.3 even in very new projects
  2. the package remains necessary in legacy .NET Framework projects (those not using a .NET SDK)
  3. your point doesn't address the odd XML declaration requirement (even when a BOM exists in an XML stream) nor the limited encoding support (UTF8, UTF16BE, UTF16LE) support in both Serialization.Json or Serialization.Xml.
teo-tsirpanis commented 1 year ago
  1. True, but .NET Standard 1.x does not qualify as "modern". 😅
  2. Just tested it, the package is not necessary, a reference to the inbox System.Runtime.Serialization assembly suffices.
  3. Also true, I had acknowledged that it's unrelated to the bug you reported, it was just an observation about the package's use.
StephenMolloy commented 1 year ago

To help with triage, I quickly ran the repro on a few different runtimes... it seems this has been an issue going back to at least 4.8.