Closed MarkPflug closed 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @dotnet/area-system-xml See info in area-owners.md if you want to be subscribed.
Author: | MarkPflug |
---|---|
Assignees: | - |
Labels: | `area-System.Xml`, `tenet-performance`, `untriaged` |
Milestone: | - |
This is very typical but the culprit is not the XML reader but the ZIP decompression. It doesn't have the same internal buffering that it had on .NET Framework. Try wrapping the sheetStream
in BufferedStream
to see how it behaves. We observed the same thing with code in ML.NET.
@filipnavara Good thought, but that doesn't seem to be related. I added two more benchmarks. The first was to explore your BufferedStream
suggestion. The second was to remove the compression from the equation and read directly from the sheet1.xml.
[Benchmark]
public void ZipXmlBuffered()
{
using var stream = File.OpenRead(file);
var zipArchive = new ZipArchive(stream);
var sheetEntry = zipArchive.GetEntry("xl/worksheets/sheet1.xml");
using var sheetStream = sheetEntry.Open();
using var bufferedStream = new BufferedStream(sheetStream, 0x10000);
var r = XmlReader.Create(bufferedStream);
while (r.Read()) ;
}
[Benchmark]
public void Xml()
{
var r = XmlReader.Create("Data/sheet1.xml");
while (r.Read()) ;
}
Running on .NET 6.0: | Method | Mean | Error | StdDev | Allocated |
---|---|---|---|---|---|
Xml | 523.35 ms | 9.990 ms | 1.546 ms | 141 KB | |
ZipXml | 568.54 ms | 17.753 ms | 6.331 ms | 236 KB | |
ZipXmlBuffered | 570.30 ms | 24.874 ms | 8.870 ms | 305 KB | |
Zip | 48.46 ms | 0.860 ms | 0.223 ms | 76 KB |
Same benchmark running against .NET 5.0:
Method | Mean | Error | StdDev | Allocated |
---|---|---|---|---|
Xml | 265.45 ms | 30.412 ms | 10.845 ms | 141 KB |
ZipXml | 300.84 ms | 7.078 ms | 1.838 ms | 240 KB |
ZipXmlBuffered | 297.89 ms | 15.315 ms | 5.462 ms | 304 KB |
Zip | 51.62 ms | 0.718 ms | 0.256 ms | 76 KB |
I tried other combinations of constructing the XmlReader
with a FileStream with buffersize = 1 (which as I understand, disables buffering) with a 64k buffer, with a StreamReader
with different buffer sizes. Nothing seems to make much of a difference. I'm happy to be wrong, but it really looks like the culprit remains in the XmlReader.Read
method.
Can you share the source code of the benchmark please? (Sorry, I accidentally scrolled over the part where you posted it.) For the record, this is the issue a member of our team filed for ML.NET. It is nearly identical scenario of XML within a ZIP file.
To clarify one thing: this issue exhibits between .NET 5.0 and .NET 6.0. My mention of .NET framework was only because that is what brought it to my attention.
XmlReader.Read
is the core XML API for .NET so I assume it should be considered performance-critical. This performance regression can also be seen in APIs such as XmlDocument.Load
for example, which for the same file goes from 1,649ms on .NET 5.0 to 1,893ms on .NET 6.0.
Looking at the change history on XmlTextReaderImpl, there appear to be two significant changes in the last year (roughly since 5.0). Both were to cleanup unsafe code.
The first one appears to remove unsafe code from the XmlCharType
, which is used to categorize characters for the parser. I could easily see this one causing a performance regression, as it seems to be a core part of the parser. However, it looks like great care was taken to preserve the performance of the original.
https://github.com/dotnet/runtime/commit/f1f7681464bc8e6662e15242d3babca75c914565
The "unsafe" code:
//internal byte* charProperties;
charProperties[str[i]]
Was replaced with "Unsafe" code:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint GetCharProperties(char ch) => Unsafe.Add(ref MemoryMarshal.GetReference(CharProperties), ch);
I think this "Unsafe" version is attempting to avoid the bounds check as that of the original pointer version.
This second change removes unsafe code from AdjustLineInfo
.
https://github.com/dotnet/runtime/commit/997174511ab98c4fa409912db17ce18e7116b126
It doesn't appear to make any great effort to avoid bounds checks, but I don't know how important that is, as I don't know how often it's called. If it's called frequently, perhaps its enough to show the performance degredation?
@MarkPflug can you try following:
new MemoryStream(File.ReadAllBytes(...))
)if that will still produce the same bad results I'll take a look and possibly revert that change. I can't promise it will make it to 6.0
@krwq I added two more benchmarks, one with MemoryStream as you requested, and one that uses a StringReader in an attempt to remove yet another layer of IO abstraction:
[Benchmark]
public void XmlMemoryStream()
{
// memStream contains the file bytes
memStream.Seek(0, SeekOrigin.Begin);
var r = XmlReader.Create(memStream);
while (r.Read()) ;
}
[Benchmark]
public void XmlStringReader()
{
// xml is the string from File.ReadAllText
var ms = new StringReader(xml);
var r = XmlReader.Create(ms);
while (r.Read()) ;
}
This doesn't change the performance significantly. Results:
Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|
Xml | 511.39 ms | 8.539 ms | 1.321 ms | - | - | - | 348 KB |
XmlMemoryStream | 517.00 ms | 22.310 ms | 7.956 ms | - | - | - | 140 KB |
XmlStringReader | 505.42 ms | 5.684 ms | 1.476 ms | - | - | - | 155 KB |
XmlLoad | 1,797.19 ms | 63.593 ms | 16.515 ms | 73000.0000 | 27000.0000 | 4000.0000 | 422,857 KB |
ZipXml | 559.06 ms | 6.883 ms | 0.377 ms | - | - | - | 236 KB |
ZipXmlBuffered | 565.74 ms | 37.710 ms | 13.448 ms | - | - | - | 305 KB |
Zip | 48.02 ms | 0.539 ms | 0.083 ms | - | - | - | 76 KB |
@MarkPflug I'm seeing that when they did https://github.com/dotnet/runtime/pull/41756 they've run some benchmarks and there wasn't any significant difference. I'm curious, could you try going back with git into state where this PR was merged and run the benchmark against your XML file? We should figure out if there was some other change which could have regressed this or if there is something specific in your XML which they've missed out when testing. Either way we should permanently add this benchmark and improve this back
Uh, I've never built the runtime before, so I don't really know what I need to do. I pulled the repo and ran build.cmd. This took quite a while, but seemed to finish successfully. I then checked out a commit prior to the the #41756 version and ran build.cmd again. This time it failed with:
EXEC : error : DIA SDK is missing at "C:\Program Files\Microsoft Visual Studio\2022\Preview\DIA SDK". Did you install all the requirements for building on Windows, including the "Desktop Development with C++" workload?
I don't really think I'm going to be able to diagnose this.
@MarkPflug please see https://github.com/dotnet/runtime/blob/main/docs/workflow/requirements/windows-requirements.md for windows build prerequisites and instructions to build.
I'm not able to reproduce this regression on my machine; in fact, .NET 6.0 is fastest.
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1348 (21H2)
Intel Core i7-10875H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.100
[Host] : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
Job-VFTWLF : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT
Job-EVXDCZ : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
Job-DHEGHK : .NET Framework 4.8 (4.8.4420.0), X64 RyuJIT
Method | Job | Runtime | Toolchain | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|---|---|
ZipXml | Job-VFTWLF | .NET 5.0 | net5.0 | 316.1 ms | 4.77 ms | 8.61 ms | 1.00 | 0.00 |
ZipXml | Job-EVXDCZ | .NET 6.0 | net6.0 | 282.0 ms | 5.63 ms | 10.15 ms | 0.89 | 0.03 |
ZipXml | Job-DHEGHK | .NET Framework 4.8 | net48 | 332.6 ms | 2.78 ms | 2.32 ms | 1.03 | 0.04 |
XmlStringReader | Job-VFTWLF | .NET 5.0 | net5.0 | 241.8 ms | 1.95 ms | 1.63 ms | 1.00 | 0.00 |
XmlStringReader | Job-EVXDCZ | .NET 6.0 | net6.0 | 205.5 ms | 3.91 ms | 4.95 ms | 0.86 | 0.02 |
XmlStringReader | Job-DHEGHK | .NET Framework 4.8 | net48 | 235.8 ms | 3.35 ms | 2.97 ms | 0.98 | 0.01 |
@bgrainger is correct, the issue doesn't exist in 6.0. I figured out why my measurements were wrong: "COMPlus_TieredCompilation" was set to "0" on my machine. I don't remember ever setting that, but I guess I must have. It seems that this particular xml code path is greatly affected by that setting in 6.0, whereas it wasn't in previous versions. Anyway, deleting that env var fixes this "regression". Whoops, sorry for wasting everyone's time.
Description
In adding support for .NET framework 4.8 to a library I maintain I was surprised to find that it was benchmarking significantly faster than the same code running on .NET 6.0. I narrowed the issue down to what appears to be a performance regression in
XmlTextReaderImpl.Read()
in .NET 6.0 specifically. The library implements a data reader for .xlsx files, and so requires unzipping and reading xml. I'd created baseline benchmarks that decompress and read the xml nodes to understand where my performance floor was:The exact file I use is the .xlsx file here. The benchmark results of running these on my machine targeting various framework versions:
The ZipXml for net6.0 is an obvious outlier, almost twice as slow as the other framework versions. All of these are running x64, benchmark.net reporting
X64 RyuJIT
. I specifically call outXmlTextReaderImpl
because that is the concrete implementation ofRead
that is being called on net6.0. No idea if that's true for other frameworks or not.I have .NET 6.0 version
6.0.100-rc.1.21458.32
currently installed.Hats off to net4.8, way to go champ.
area-System.Xml