Startup cost of XamlServices parsing is 27% slower on .NETCore than on .NET Framework

dotMorten commented 5 years ago

.NET Core Version: 3.0 Preview1
Windows version: 1803

Does the bug reproduce also in WPF for .NET Framework 4.8?: No

Problem description: I expected/hoped that with the XmlReader hopefully being Span<T> based. the XAML parser would be faster in .NET Core than the .NET Framework equivalent. However that's not what I'm seeing with this benchmark:

public class BenchmarkTests
{
    static readonly string xamlString = @"<MyObject xmlns=""clr-namespace:XamlBenchmark;assembly=XamlBenchmark"" StringProperty=""Hello World"" Int32Property=""1234"" DoubleProperty=""123.4567890"" FloatProperty=""-0.9876"" />";

    [Benchmark]
    public object Test1()
    {
        object instance = System.Xaml.XamlServices.Parse(xamlString);
        return instance;
    }
}
public class MyObject
{
    public string StringProperty { get; set; }
    public int Int32Property { get; set; }
    public double DoubleProperty { get; set; }
    public float FloatProperty { get; set; }
}

Here are the results:


BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.1 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1620 v3 3.50GHz, 1 CPU, 8 logical and 4 physical cores
  [Host]     : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3190.0
  Job-FQKGZY : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3190.0
  Job-YTBJBT : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT

Method	Runtime	Toolchain	Mean	Error	StdDev	Ratio	RatioSD
Test1	Clr	net472	363.3 us	7.139 us	7.639 us	1.00	0.00
Test1	Core	netcoreapp3.0	461.6 us	9.078 us	8.048 us	1.27	0.03

fabiant3-zz commented 5 years ago

Thanks @dotMorten. This is likely to be expected, as Core is not currently ngened. We'll address before Core3 ships.

karelz commented 5 years ago

cc @brianrob @billwert

karelz commented 5 years ago

Baiting @benaadams @AndreyAkinshin if they have interest to look for low-hanging fruit here 😉

dotMorten commented 5 years ago

Also looking at allocations by running this parse method many many times, I see this hit over 40 times per parsing:

And here's some of the CPU analysis:

0xd4d commented 5 years ago

Loading the assemblies' custom attributes takes a lot of time. The more assemblies that are loaded, the slower it will take. If you create a simple hello world app that tests this, .NET Framework loads 5 assemblies but .NET Core 3.0 preview loads 23 assemblies.

There's a XamlSchemaContext that could perhaps be reused since it's the class that loads all assemblies.

EDIT: If a XamlSchemaContext is used, .NET Core 3.0 preview is faster than .NET Framework 4.7.2 (tested 64-bit)

richlander commented 5 years ago

I love seeing this kind of issue. Great to see the focus on fundamentals.

danmoseley commented 5 years ago

Re GetCustomAttributes, it was improved by @nickcraver in https://github.com/dotnet/coreclr/pull/20779 but that change would already be in preview 1.

0xd4d commented 5 years ago

@danmosemsft It's also 24 loaded assemblies (.NET Core 3.0 latest daily build) vs 5 loaded assemblies (.NET Framework 4.7.2).

If you patch System.Xaml to only check the first 5 returned assemblies (.NET Framework only loads 5 assemblies), then .NET Core 3.0 code executes faster.

So it seems the 5x number of assemblies is causing the slowdown. It has to execute more code.

Original .NET Core 3.0 daily build:

Name	Exc %	Exc	Inc %	Inc
module system.xaml <<system.xaml!XamlServices.Parse>>	24.0	1,073	91.6	4,090
module system.private.corelib <<system.private.corelib!Attribute.GetCustomAttributes>>	14.8	662	36.3	1,619
module coreclr <<coreclr!JIT_New>>	10.3	462	10.6	472
module clrjit <<clrjit!CILJit::compileMethod>>	5.4	239	5.4	243
module coreclr <<coreclr!JIT_NewArr1>>	5.4	239	5.7	253
module coreclr <<coreclr!ModuleHandle::ResolveType>>	5.2	234	5.2	234
module coreclr <<coreclr!MetaDataImport::GetCustomAttributeProps>>	5.2	231	5.6	248
module system.private.corelib <<system.private.corelib!CustomAttributeData.GetCustomAttributesInternal>>	4.9	219	22.9	1,023
module coreclr <<coreclr!Attribute::ParseAttributeArguments>>	4.1	183	5.4	241
module coreclr <<coreclr!MetaDataImport::Enum>>	3.4	152	4.0	178
module ntdll <<ntdll!LdrpDispatchUserCallTarget>>	2.6	116	2.6	116
module coreclr <<coreclr!MetaDataImport::GetParentToken>>	2.6	116	2.8	127

brianrob commented 5 years ago

Finally had a chance to look into this. Really great to see folks engaged here.

I was able to reproduce the difference and captured traces of 10K iterations of the parse operation to compare.

As was mentioned above, the biggest issue that I see is significantly higher cost in and caused by Attribute.GetCustomAttributes. The higher cost shows up in both CPU usage in Attribute.GetCustomAttributes and in allocation helpers that are zeroing memory to support increased allocations in this path.

The biggest CPU usage differences are:

Attribute.GetCustomAttributes: 560ms on Core vs. 265ms on Desktop = +111%
Allocation helpers (JIT_New and JIT_NewArr1): 907ms on Core vs. 597ms on Desktop = +52%

Breaking down the allocations that contribute to the additional allocation helper CPU (as well as memory cost), these are the top 3 allocators:

System.Reflection.CustomAttributeRecord[]
System.Xaml.MS.Impl.XmlNsInfo
Node[System.Reflection.Assembly.System.Xaml.MS.Impl.XmlNsInfo]

The first step here should be for this to be investigated from the reflection angle. @joshfree, can you get eyes on this from a reflection expert? I can share traces offline and help ramp someone up on the data.

NickCraver commented 5 years ago

Having taken some steps here already, it definitely seems like we can eliminate CustomAttributeRecord[] in some way. These are allocated only to be iterated in the caller in all cases, but the way they're populated is non-trivial to change. I have a branch locally that tried to eliminate these but it creates some very odd null refs in ways I didn't readily understand (but someone else may quickly see why I'm an idiot there).

If you're curious, this is where they're generated for a great many attribute code paths.

0xd4d commented 5 years ago

@brianrob

Attribute.GetCustomAttributes: 560ms on Core vs. 265ms on Desktop = +111%

This is probably because .NET Core loads 24 assemblies but .NET Framework only loads 5 assemblies. The code gets custom attributes of all loaded assemblies.

brianrob commented 5 years ago

@0xd4d, you're right - the iteration over all assemblies happens here: https://github.com/dotnet/wpf/blob/08d1133388dac9f033e16bb7309d0d2d409845a2/src/Microsoft.DotNet.Wpf/src/System.Xaml/System/Xaml/XamlSchemaContext.cs#L1125

As @NickCraver points out, there is possibly some optimization that can happen in the reflection path, but it is also true that there may need to be some work done in Xaml parsing to either limit the number of assemblies that get processed, or cache the results. Without a clear understanding of the WPF technology, it's not clear to me what should be done here. @fabiant3, are you the right person to look at this?

karelz commented 5 years ago

cc @steveharter for Reflection

benaadams commented 5 years ago

18k calls from Parse -> XamlXmlReader.Initalize -> NodeStreamSorter.ctor -> StartNewNodeStreamWithSettingsPreamble does become 14M calls to FilterCustomAttributeRecord and that's 44% of the time for Parsing.

benaadams commented 5 years ago

or cache the results.

XamlSchemaContext has 10 ConcurrentDictionarys (instance rather than static) so I wonder if it had that in mind?

The main cost for Parsing (64%) happens under System.Xaml.XamlSchemaContext.TryGetCompatibleXamlNamespace(String, out String) where its still the same number of calls as Parse (18k in this case) and then begins to fan out.

benaadams commented 5 years ago

Would it be valid to make those ConcurrentDictionaries static?

Sounds like the context is quite flexible on usage:

/// SchemaContexts can potentially be shared between multiple callers in an AppDomain, including
/// both full and partial trust callers. To be safe for sharing, the default implementation should
/// be idempotent and order-independent--i.e. functionally immutable.

Though I don't know if lookup should be unique in some way (e.g. different Xaml files having explicitly different contexts)

NickCraver commented 5 years ago

There are multiple things we can yet do in reflection and I'd like to get to them, but I second @benaadams's approach here on an actual solution that would make the differences really desired here. The problem with reflection performance optimizations is they hit a hard wall with contracts: they must return a new attribute set on each call - we cannot return the same instances (this would break all sorts of existing code).

The only way we can cache instances (after deciding it's valid to do so) is well above the layer those allocations happen at, e.g. in a lookup dictionary such as those in play here.

NickCraver commented 5 years ago

I just wanted to add here: in doing some benchmark down we discovered that while my optimizations went into 3.0 Preview 1, some steps backwards also landed in there. Details are in https://github.com/dotnet/coreclr/issues/21456 and a PR to fix that regression is in https://github.com/dotnet/coreclr/pull/21462. That should help out here, but only marginally compared to not actually calling this path more than once as proposed above.

joshfree commented 5 years ago

@steveharter can you assist in the investigation from the coreclr/Reflection side of the house?

steveharter commented 5 years ago

Sure I'll take a look asap at the reflection side

SamBent commented 5 years ago

Moving this to 3.1. The discussion has targeted the initialization of XamlSchemaContext within XamlServices.Parse (et al.), which examines more assemblies than it did in NETFx. I don't think it's critical to fix this for 3.0, because

it only affects parse/load of loose XAML
it's an initialization expense, whose relative effect diminishes as the string being parsed grows longer or more complex
it's already possible for an app to mitigate this

In short, the effect will be felt in practice only by apps that load lots of small XAML snippets. This isn't the mainstream case, and such apps have a workaround.

Details supporting these claims.

All apps load BAML, which uses a built-in SchemaContext. But relatively few apps load loose XAML via explicit calls to XamlServices.Parse et al. At least that's my impression from looking at apps that have come my way (largely through bug reports) - we don't have telemetry for this.

The 27% figure comes from parsing a very simple string (one XAML tag); a real app typically parses longer strings, and will see less of a difference. I tried strings with a few hundred tags, and the difference was not worth noting.

Most of the XamlServices methods call new XamlSchemaContext(), but an app can avoid the initialization expense by using the one method that doesn't. Replace XamlServices.Parse(mystring) with

            StringReader stringReader = new StringReader(mystring);
            using (XmlReader xmlReader = XmlReader.Create(stringReader))
            {
                XamlXmlReader xamlReader = new XamlXmlReader(xmlReader, CachedSchemaContext);
                return XamlServices.Load(xamlReader);
            }

where CachedSchemaContext is a XamlSchemaContext that the app creates once and re-uses every time it calls Parse. (Similarly for XamlServices.Load(stream) et al.)

An app that does this is responsible for releasing the cached context when it's no longer needed, if that's relevant to the overall behavior. This is probably not an issue, except perhaps for apps with complex interaction with AppDomain, or similar advanced architecture.

I've tried building this idea into XamlServices itself. The early results are good, but it needs more investigation. For example, XamlServices can't know when to release the cached context, and I don't fully understand the consequences of keeping it alive "forever".

weltkante commented 5 years ago

and I don't fully understand the consequences of keeping it alive "forever"

If thats the chosen path the cache might need weak references (making it even more expensive) to not block unloadable assemblies "forever", otherwise it would be a memory leak with apps having reloadable plugin architectures or are emitting generated code in collectible assemblies at runtime. We are doing the latter on desktop, I haven't used AssemblyLoadContext in Core yet so I don't know entirely the implications, but its something to keep in mind.

SamBent commented 4 years ago

Changed the title to reflect the real issue.

bugproof commented 4 years ago

executing HTTP requests in WPF command is much slower on .NET Core idk what causes this but it works fine in .NET Framework. (I'm using refit btw)

not sure if it's WPF problem or .NET Core problem or refit problem... https://github.com/dotnet/corefx/issues/23401

EDIT:

SOLVED. See this comment https://github.com/dotnet/corefx/issues/23401#issuecomment-561671803

SamBent commented 4 years ago

For reference: @bugproof comment is unrelated to this issue or to WPF. (.NET Core implements HTTPRequest differently from .NET Framework.)

rollsch commented 4 years ago

Just FYI it is issues like this that have caused us to stop our port to netcore. I had hoped performance would be better. I hope these issues continue to see activity and improvement.

SamBent commented 4 years ago

@rolandh - We determined that this issue only affects apps that load lots of small XAML snippets using XamlServices.Parse. Is that your case? (We believe it's not all that common, but we don't have any hard evidence to either support or refute that belief.)

Please continue to open issues, or upvote existing issues, about specific perf concerns. Comments about "issues like this" are not really actionable, as no one knows what you're referring to specifically.

rollsch commented 4 years ago

I haven't experienced any issues as we haven't ported to net core yet. I was just saying that seeing multiple issues like this has meant that we won't be considering porting ove from net 4.7 in the short term. I've seen quite a few benchmarks showing slow downs, I haven't seen any showing performance increases in WPF yet, when I do I'll look at porting. I thought that might be good information for MS on how to prioritise issues and development.

I know its not an actionable comment for you, but it might be for someone higher up who wants to see a higher percentage off people port their applications.

Lxiamail commented 4 years ago

I was just saying that seeing multiple issues like this has meant that we won't be considering porting ove from net 4.7 in the short term. @rolandh can you articulate the multiple issues that you have been and concerned you moving to .NET Core besides this one? That will help us to priorities perf investigation.

rollsch commented 4 years ago

@Lxiamail

Pretty much every major library's first iteration to netcore has had slowdowns. Sometimes dramatic, eg 10x slower with entity frameworks 2.1 release.

I see almost nothing showing performance benefits, only contrived small applications seem to benefit. Everything that is a large in production library seems to go backwards and introduce bugs.

There is no incentive for us to upgrade after seeing issues like this repeatedly occur after porting to netcore.

Here is a list of relevant cases I've seen when researching whether netcore is stable enough to upgrade yet.

https://github.com/dotnet/wpf/issues/1549 https://github.com/dotnet/wpf/issues/1748

https://supportcenter.devexpress.com/ticket/details/T824984/wpf-core-3-0-performance

This was nopcommerce after upgrading to netcore https://www.nopcommerce.com/en/boards/topic/54986/nopcommerce-410-high-memory-usage-and-how-weve-handled-it

EF Core large slow downs vs the older entity framework https://github.com/nopSolutions/nopCommerce/issues/3419

philipag commented 4 years ago

Just my 2 cents: I also ported a large WPF+EntityFramework app to core 3.0 and found that load times increased from <4 seconds for .Net Framework to 5+ seconds with .Net Core 3.0. This is with creating native images for both in release and debug builds. Runtime performance seemed largely unchanged so I abandoned this porting effort and will try this again with .Net 5 when it is released. There are numerous reports of slow-downs like in the post above and also this one: https://github.com/dotnet/wpf/issues/94

My impression is that some work is required to address all these already reported WPF + .Net Core performance issues in order to make it worthwhile to perform this migration for more complex applications with lots of assemblies and XAML resources.

Cheers Philip

danmoseley commented 4 years ago

@ajcvickers re EF feedback above.

laggage commented 4 years ago

I seemed to encounter the performance problem too; The same wpf code target .net core 3.1 and another .net 4.8, the latter memory usage is about 20MB while the former is 32MB+ almost 60MB at the startup of program; Read a same file as hex and show on a TextBox, the memory usage is above, still, net core cost more memory than net48 ; I felt confused

benaadams commented 4 years ago

@laggage your .NET 48 is running as 32-bit and .NET Core is running as 64-bit

Does adding <RuntimeIdentifier>win-x86</RuntimeIdentifier> to your .csproj help? (moving it to 32-bit)

laggage commented 4 years ago

@benaadams I tried your suggestion, but still, core wpf just cost more memory than net4.x wpf;

philipag commented 4 years ago

@SamBent This issue was milestoned to core 5.0 but I don't see any updates even though 5.0 is being released. Is there reason to believe these performance regressions were fixed and the people on this forum should evaluate this again in their applications.

SamBent commented 4 years ago

The original issue - startup cost of XamlServices.Parse - is not yet addressed in 5.0. It only comes up if the app explicitly asks for it, so it's not a factor in load times of most apps.

A number of other issues have been mentioned here, most of them outside WPF's scope. You can follow the links to check on the status with the appropriate owners.

philipag commented 4 years ago

@SamBent In my case the problems may be DataTemplate instantiation via XamlReader.Parse which I assume internally calls XamlServices.Parse.

Will this issue be assigned to a new milestone or will WPF apps which make using of XAML parsing see permanent load time degradation if ported to .Net Core?

ekalchev commented 2 years ago

This was milestoned to 7.0 but the startup time of WPF apps running .net7.0 are still very slow compared to .net 4.8

TimexPeachtree commented 9 months ago

Pls is there any updates on this from .net 8

rollsch commented 2 months ago

This is a very old thread but can someone give any benchmarks for net 8 vs net 4.7/4.8 ?

ekalchev commented 2 months ago

This is a very old thread but can someone give any benchmarks for net 8 vs net 4.7/4.8 ?

The issue is still present. We recently migrated our net48 product to .net8. We have automated benchmarking after each build. We can clearly see how almost everything at startup is performing better except the xaml parsing. This negates all performance gains of the other parts of the code where thing are performing better. We end up with around 10% slower startup times, because wpf performance issues. Everyone is very disappointed because .net 8 promised performance but I hasn't been delivered in wpf apps.

rollsch commented 2 months ago

Thank you for saving us an immense amount of effort for little gain.

ThomasGoulet73 commented 2 months ago

I ran the benchmarks in the original post locally but I now compare .Net Framework 4.8 vs .Net 8.0 and here are my results:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3)
AMD Ryzen 7 7745HX with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100-preview.7.24407.12
  [Host]     : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XTDWJG : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-ELJHHU : .NET Framework 4.8.1 (4.8.9261.0), X64 RyuJIT VectorSize=256

Method	Runtime	Mean	Error	StdDev
Test1	.NET 8.0	150.6 us	3.01 us	6.08 us
Test1	.NET Framework 4.8	183.2 us	3.31 us	4.54 us

It looks like .Net 8.0 is faster than .Net Framework 4.8 for this specific benchmark on my machine. If others encounter any performance regression from .Net Framework, it would be great if they could share them and ideally include a repro.

dotnet / wpf

Startup cost of XamlServices parsing is 27% slower on .NETCore than on .NET Framework #94