Proposal: SyntaxNode.ToFullStringWithNormalizedWhitespaces API

Sergio0694 commented 3 years ago

Overview

A common pattern in source generators (from what I can tell) for people using Roslyn APIs to generate trees is as follows:

Build a compilation unit with some members (or, some syntax tree in general)
Call NormalizeWhitespace()
Call ToFullString()
Call context.AddSource("Foo.cs", SourceText.From(source, Encoding.UTF8)); with the resulting string

This is extremely inefficient, specifically at points (2) and (3). The issue is that NormalizeWhitespace is quite expensive, and in this case it's also completely unnecessary: we don't want a new tree per se, we just need the final string to have the code from the initial tree with normalized whitespaces. The problem is that that NormalizeWhitespace ends up taking the majority of the CPU time in this whole workflow, and also allocating unnecessary memory. The allocations aren't necessarily a problem, but the increased CPU time very much is, considering the performance critical nature of source generators to keep a smooth IDE experience.

I've put together a benchmark to show how much NormalizeWhitespace can impact performance:

Method	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
CreateNormalizeToString	916.2 us	9.12 us	8.53 us	1.00	32.2266	8.7891	1.9531	184.57 KB
CreateToString	178.4 us	0.41 us	0.34 us	0.19	19.2871	4.6387	-	79.21 KB

Full benchmark source code (click to expand):

```csharp using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using Microsoft.CodeAnalysis; using Microsoft.CodeAnalysis.CSharp; using Microsoft.CodeAnalysis.CSharp.Syntax; using static Microsoft.CodeAnalysis.CSharp.SyntaxFactory; BenchmarkRunner.Run(); [MemoryDiagnoser] public class NormalizeWhitespaceBenchmark { [Benchmark(Baseline = true)] public string CreateNormalizeToString() { return GetDummyTree().NormalizeWhitespace().ToFullString(); } [Benchmark] public string CreateToString() { return GetDummyTree().ToFullString(); } private static CompilationUnitSyntax GetDummyTree() { // // Licensed to the .NET Foundation under one or more agreements. // // The .NET Foundation licenses this file to you under the MIT license. // // See the LICENSE file in the project root for more information. // // #pragma warning disable // // namespace Microsoft.Toolkit.Mvvm.Messaging.__Internals // { // [global::System.CodeDom.Compiler.GeneratedCode("...", "...")] // [global::System.Diagnostics.DebuggerNonUserCode] // [global::System.Diagnostics.CodeAnalysis.ExcludeFromCodeCoverage] // [global::System.ComponentModel.EditorBrowsable(global::System.ComponentModel.EditorBrowsableState.Never)] // [global::System.Obsolete("This type is not intended to be used directly by user code")] // internal static partial class __IMessengerExtensions // { // [global::System.ComponentModel.EditorBrowsable(global::System.ComponentModel.EditorBrowsableState.Never)] // [global::System.Obsolete("This method is not intended to be called directly by user code")] // public static global::System.Action CreateAllMessagesRegistrator( _) // { // static void RegisterAll(IMessenger messenger, object obj) // { // var recipient = ()obj; // messenger.Register<>(recipient); // messenger.Register<>(recipient); // messenger.Register<>(recipient); // } // // return RegisterAll; // } // // [global::System.ComponentModel.EditorBrowsable(global::System.ComponentModel.EditorBrowsableState.Never)] // [global::System.Obsolete("This method is not intended to be called directly by user code")] // public static global::System.Action CreateAllMessagesRegistratorWithToken( _) // where TToken : global::System.IEquatable // { // static void RegisterAll(IMessenger messenger, object obj, TToken token) // { // var recipient = ()obj; // messenger.Register<, TToken>(recipient, token); // messenger.Register<, TToken>(recipient, token); // messenger.Register<, TToken>(recipient, token); // } // // return RegisterAll; // } // } // } return CompilationUnit().AddMembers( NamespaceDeclaration(IdentifierName("Microsoft.Toolkit.Mvvm.Messaging.__Internals")).WithLeadingTrivia(TriviaList( Comment("// Licensed to the .NET Foundation under one or more agreements."), Comment("// The .NET Foundation licenses this file to you under the MIT license."), Comment("// See the LICENSE file in the project root for more information."), Trivia(PragmaWarningDirectiveTrivia(Token(SyntaxKind.DisableKeyword), true)))).AddMembers( ClassDeclaration("__IMessengerExtensions").AddModifiers( Token(SyntaxKind.InternalKeyword), Token(SyntaxKind.StaticKeyword), Token(SyntaxKind.PartialKeyword)).AddAttributeLists( AttributeList(SingletonSeparatedList( Attribute(IdentifierName($"global::System.CodeDom.Compiler.GeneratedCode")) .AddArgumentListArguments( AttributeArgument(LiteralExpression(SyntaxKind.StringLiteralExpression, Literal("Microsoft.Toolkit.Mvvm"))), AttributeArgument(LiteralExpression(SyntaxKind.StringLiteralExpression, Literal("7.0.0.0")))))), AttributeList(SingletonSeparatedList(Attribute(IdentifierName("global::System.Diagnostics.DebuggerNonUserCode")))), AttributeList(SingletonSeparatedList(Attribute(IdentifierName("global::System.Diagnostics.CodeAnalysis.ExcludeFromCodeCoverage")))), AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.ComponentModel.EditorBrowsable")).AddArgumentListArguments( AttributeArgument(ParseExpression("global::System.ComponentModel.EditorBrowsableState.Never"))))), AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.Obsolete")).AddArgumentListArguments( AttributeArgument(LiteralExpression( SyntaxKind.StringLiteralExpression, Literal("This type is not intended to be used directly by user code"))))))).AddMembers( MethodDeclaration( GenericName("global::System.Action").AddTypeArgumentListArguments( IdentifierName("IMessenger"), PredefinedType(Token(SyntaxKind.ObjectKeyword))), Identifier("CreateAllMessagesRegistrator")).AddAttributeLists( AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.ComponentModel.EditorBrowsable")).AddArgumentListArguments( AttributeArgument(ParseExpression("global::System.ComponentModel.EditorBrowsableState.Never"))))), AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.Obsolete")).AddArgumentListArguments( AttributeArgument(LiteralExpression( SyntaxKind.StringLiteralExpression, Literal("This method is not intended to be called directly by user code"))))))).AddModifiers( Token(SyntaxKind.PublicKeyword), Token(SyntaxKind.StaticKeyword)).AddParameterListParameters( Parameter(Identifier("_")).WithType(IdentifierName("global::MyApp.MyViewModel"))) .WithBody(Block( LocalFunctionStatement( PredefinedType(Token(SyntaxKind.VoidKeyword)), Identifier("RegisterAll")) .AddModifiers(Token(SyntaxKind.StaticKeyword)) .AddParameterListParameters( Parameter(Identifier("messenger")).WithType(IdentifierName("IMessenger")), Parameter(Identifier("obj")).WithType(PredefinedType(Token(SyntaxKind.ObjectKeyword)))) .WithBody(Block( LocalDeclarationStatement( VariableDeclaration(IdentifierName("var")) .AddVariables( VariableDeclarator(Identifier("recipient")) .WithInitializer(EqualsValueClause( CastExpression( IdentifierName("global::MyApp.MyViewModel"), IdentifierName("obj"))))))) .AddStatements( ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName("global::MyApp.MessageA")))) .AddArgumentListArguments(Argument(IdentifierName("recipient")))), ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName("global::MyApp.MessageA")))) .AddArgumentListArguments(Argument(IdentifierName("recipient")))), ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName("global::MyApp.MessageA")))) .AddArgumentListArguments(Argument(IdentifierName("recipient")))))), ReturnStatement(IdentifierName("RegisterAll")))), MethodDeclaration( GenericName("global::System.Action").AddTypeArgumentListArguments( IdentifierName("IMessenger"), PredefinedType(Token(SyntaxKind.ObjectKeyword)), IdentifierName("TToken")), Identifier("CreateAllMessagesRegistratorWithToken")).AddAttributeLists( AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.ComponentModel.EditorBrowsable")).AddArgumentListArguments( AttributeArgument(ParseExpression("global::System.ComponentModel.EditorBrowsableState.Never"))))), AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.Obsolete")).AddArgumentListArguments( AttributeArgument(LiteralExpression( SyntaxKind.StringLiteralExpression, Literal("This method is not intended to be called directly by user code"))))))).AddModifiers( Token(SyntaxKind.PublicKeyword), Token(SyntaxKind.StaticKeyword)).AddParameterListParameters( Parameter(Identifier("_")).WithType(IdentifierName("global::MyApp.MyViewModel"))) .AddTypeParameterListParameters(TypeParameter("TToken")) .AddConstraintClauses( TypeParameterConstraintClause("TToken") .AddConstraints(TypeConstraint(GenericName("global::System.IEquatable").AddTypeArgumentListArguments(IdentifierName("TToken"))))) .WithBody(Block( LocalFunctionStatement( PredefinedType(Token(SyntaxKind.VoidKeyword)), Identifier("RegisterAll")) .AddModifiers(Token(SyntaxKind.StaticKeyword)) .AddParameterListParameters( Parameter(Identifier("messenger")).WithType(IdentifierName("IMessenger")), Parameter(Identifier("obj")).WithType(PredefinedType(Token(SyntaxKind.ObjectKeyword))), Parameter(Identifier("token")).WithType(IdentifierName("TToken"))) .WithBody(Block( LocalDeclarationStatement( VariableDeclaration(IdentifierName("var")) .AddVariables( VariableDeclarator(Identifier("recipient")) .WithInitializer(EqualsValueClause( CastExpression( IdentifierName("global::MyApp.MyViewModel"), IdentifierName("obj"))))))) .AddStatements( ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName("global::MyApp.MessageA"), IdentifierName("TToken")))) .AddArgumentListArguments(Argument(IdentifierName("recipient")), Argument(IdentifierName("token")))), ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName("global::MyApp.MessageA"), IdentifierName("TToken")))) .AddArgumentListArguments(Argument(IdentifierName("recipient")), Argument(IdentifierName("token")))), ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName("global::MyApp.MessageA"), IdentifierName("TToken")))) .AddArgumentListArguments(Argument(IdentifierName("recipient")), Argument(IdentifierName("token")))))), ReturnStatement(IdentifierName("RegisterAll"))))))); } } ```

You can see how calling NormalizeWhitespace alone increases the total CPU time by 5x in this sample benchmark.

Proposal

The simplest API proposal to address this issue would be to add a ToFullStringWithNormalizedWhitespaces API:

 namespace Microsoft.CodeAnalysis
 {
     public abstract class SyntaxNode
     {
+        public virtual string ToFullStringWithNormalizedWhitespaces();
     }
 }

This would produce a string equivalent to calling NormalizeWhitespaces().ToFullString(), but in an optimized manner, ie. without actually duplicating the whole tree and creating new nodes, but just adding the right whitespaces to the output as needed while traversing the tree. This would require very little code changes by consumers currently doing NormalizeWhitespaces().ToFullString(), but would greatly improve the performance of their source generators. I could also see some specific APIs to directly create a SourceText instance from a tree, with normalized whitespaces, which could then also skip creating the string entirely, but considering that CPU time is the biggest problem here and not the actual memory allocations, that seems to be less of a priority from what I can see. Ideally, this new ToFullStringWithNormalizedWhitespaces() API would have a performance as close to just ToFullString() as possible.

CyrusNajmabadi commented 3 years ago

The issue is that NormalizeWhitespace is quite expensive

I'd be interested to see if we could just fix that. Then everyone benefits. If not, then maybe specialized apis are needed.

Sergio0694 commented 2 years ago

I see this has been added to the 17.1 milestone but the API still needs some work. If Roslyn proposals work the same as runtime ones I assume that means we should just discuss that here and settle on an API shape that can then go to LDM review?

@CyrusNajmabadi is the blocker for you just the fact that you'd like to first investigate whether NormalizeWhitespace can be sped up enough yet to avoid the need for this API? If so, have you had a chance to look into this so far? Additionally, even if that was possible (certainly making the syntax tree traaversal faster wouldn't hurt in general), do you think that this specific API wouldn't provide enough value on its own regardless, given that it would likely still be faster and more memory efficient due to not having to rebuild and allocate a whole deep copy of the syntax tree being rendered? 🤔

Just trying to get the conversation going here as I would love for this to make it into 17.1, is all 😄

CyrusNajmabadi commented 2 years ago

is the blocker for you just the fact that you'd like to first investigate whether NormalizeWhitespace can be sped up enough yet to avoid the need for this API

Yes. Though I'm not volunteering for this :-)

sharwell commented 2 years ago

... but would greatly improve the performance of their source generators ...

I haven't seen much evidence that this is true once you combine it with:

Proper incrementality
Evaluation relative to other sources of overhead

In other words, while NormalizeWhitespace has been measurably significant at times, I've found that a source generator deemed observably slow by users achieved that characteristic through other means. Even in the case of CsWin32 (the source generator with the most pronounced impact from NormalizeWhitespace), it would not have been fixed by this proposal and it has already mitigated the problem through other means.

ghost commented 2 years ago

Closing this issue as we've seen no reply to the request for more information. If you are able to get the requested information, please add it to the issue and we will retriage it.

Sergio0694 commented 2 years ago

Uh... This feels like it wasn't resolved yet, should we reopen it? 😅

333fred commented 2 years ago

@CyrusNajmabadi can you clarify what more info you were asking for?

CyrusNajmabadi commented 2 years ago

Basically sam's point from above. Namely that this is actually an issue in practice when IGs are used properly. I feel like we need more information to indicate why this is the right place to be focusing API attention.

Eli-Black-Work commented 2 years ago

I remember another issue where it was discussed that ideally, Roslyn/VS would automatically call NormalizeWhitespace().ToFullString() on all generated code, so that exceptions thrown in genertated code would have stack traces with reasonable and consistent line numbers.

Is that still on the table? If so, maybe it would make sense to implement that and then get telemetry from VS to see if this current proposal makes sense 🙂

CyrusNajmabadi commented 2 years ago

would automatically call

It is absolutely not in the table. C# is definitely whitespace sensitive, and doing this could def break code. We should not automatically do anything. Of you want normalized code, and that's ok for your domain, then just normalize it first.

Eli-Black-Work commented 2 years ago

I'm referencing https://github.com/dotnet/roslyn/issues/49685#issuecomment-738078844

@AArnott Would you mind weighing in on this? 🙂

CyrusNajmabadi commented 1 year ago

I have no problem with people making normalize whitespace fast. Seems like it people need better performance there then we should just get some prs out to improve it, versus creating whole new apis.

As I've mentioned in the past, normalize was never something anyone spent time optimizing. So it's likely it can just be sped up if someone was interested in looking into that.

AArnott commented 1 year ago

I disagree with @sharwell that CsWin32's whitespace problem or perf problems are solved. We avoid NormalizeWhitespace completely now because it is unacceptably slow. Instead, we have our own FastSyntaxFactory class and use the Banned API analyzer to forbid use of SyntaxFactor and a bunch of overloads on various syntax classes because they use elastic trivia which kicks in some automatic normalizing (I guess... my memory is fuzzy on this as we did it long ago). In addition to trying to get some reasonable whitespace (e.g. a space after a type name so it doesn't 'collide' with the next node or token) during generation itself, we have our custom whitespace 'visitor' class that fixes indentation and some other whitespace concerns. The result is ugly, but readable. It also isn't fast. It's much faster than NormalizeWhitespace, but I just fixed a perf bug this morning where we spent 20 seconds adding space between members of a type: we had a loop where we changed a SyntaxList within the loop instead of converting it to a mutable List first and then recreating the SyntaxList at the end.

The lengths we go to in order to have acceptable perf is far in excess of what I would consider reasonable, and the resulting generated text is sub-par at best.

I don't care if it's a new API or fixing an old one, but we should not have to go to lengths to get good perf.

To Sam's comment about incremental SGs, csWin32 doesn't yet use any incremental APIs. CsWin32 is fundamentally a non-incremental thing for the most part because one character typed in any code file could potentially change generated code. I expect we can find specific syntax sub-trees that can be cached and reused, but it will be very expensive and we haven't prioritized that yet, since our UX is ok for now. My primary concern about perf right now is for our automated testing (which resembles real build time for customers), where just one run of our 'full generation' test is now taking 12 minutes on a really fast machine. We have to run it under 12 different configurations. So needless to say, I'm very interested in improving perf in the non-incremental case.

CyrusNajmabadi commented 1 year ago

@AArnott i agree with thsi:

The lengths we go to in order to have acceptable perf is far in excess of what I would consider reasonable

Currently, teh generation part of SG has zero funding. I've proposed options in the past about providing improvements and APIs here, but we haven't been able to fit them into the schedule at all because of all the other pressing matters. Generally speaking, i think this is because SGs are considered a highly advanced scenario, for niche areas. So the feeling is that the people using it are advanced enough to roll solutions, without us having to provide that ourselves. I'll push again for us to get some budget to improve things here. But i'm not certain anything will have changed in this area.

jeff-simeon commented 1 year ago

Would love to see a solution to this. 40% of our build time is spent on calls to NormalizeWhitespace (full disclosure - we are generating thousands of classes at build time, using incremental source generators).

Briefly playing with removing calls to NormalizeWhitespace and using trivia manually, then extrapolating the gains, we could see a reduction in build time from 3 minutes (on an MacBook M1 Max) to just over 1.5 minutes. Reduced memory usage was also observed (1GB vs 2GB). Unfortunately, manually generating trivia in EVERY syntax tree would be a huge pain and I'm not prepared to introduce all of that extra code to the generation process.

CyrusNajmabadi commented 1 year ago

@jeff-simeon can you jsut emit the strings directly? no point in SGs in using nodes, as we're going to have to just parse things anyways.

jeff-simeon commented 1 year ago

I suppose we could - though it would make the code a lot harder to read, and it's relatively complex already.

Sergio0694 commented 1 year ago

This was also briefly mentioned on Discord as well — in many scenarios, using syntax nodes makes the code for generators much easier to understand, follow and maintain, especially when the generators get more complicated. This applies to several generators in the .NET SDK as well (cc. @jkoritzinsky).

Imho it'd be nice to have a more optimized version of this, on second thought a WriteToWithNormalizedWhitespace would be even better, actually (or maybe both, since once uses the other). The entire tree duplication that NormalizeWhitespace does is just completely unnecessary work.

jeff-simeon commented 1 year ago

@Sergio0694 - I agree and this is exactly the scenario we have - which is probably why @AArnott wrote a FastSyntaxFactory which wraps a StringBuilder and avoids the nodes.

CyrusNajmabadi commented 1 year ago

That seems reasonable. Can you just use that factory instead?

jeff-simeon commented 1 year ago

@CyrusNajmabadi - I'm not sure what you're talking about. I'm referring to the previous post by @AArnott, where he mentions something that his team built. I do not have access to this and searching for it on the internet yields no results. I'm merely speculating as to what he did to work around the problem.

CyrusNajmabadi commented 1 year ago

though it would make the code a lot harder to read, and it's relatively complex already.

I don't think ease of readability is a core goal of SGs. Basically, every time this has come up on the team (and i've brought it up a few times), the thinking is that this is all for computers to do, and writing fast string-writing code is acceptable.

If someone wants to make their own api to help here, they can, but it's not an investment area. I proposed options in the past to make things better, but it's all be rejected as there is no capacity on the team to take on new APIs to help out effectively expert scenarios.

CyrusNajmabadi commented 1 year ago

I do not have access to this and searching for it on the internet yields no results. I'm merely speculating as to what he did to work around the problem.

It's on github:

https://github.com/microsoft/CsWin32/blob/dd815b2b9b4a16f9411baa1f2fa81863c0725fca/src/Microsoft.Windows.CsWin32/FastSyntaxFactory.cs#L13

You can fork/copy this. Note: this doesn't write to strings, it just avoids elastic trivia, which can be expensive if not needed.

CyrusNajmabadi commented 1 year ago

Imho it'd be nice to have a more optimized version of this,

@Sergio0694 Nothing is stopping teams (like .net) that really want this from writing such a thing. Trees are just a simple data structure. And if it's hyper critical that trees be the form you use and you don't want to use trivia and you want to write it out to a string quickly, then writing your own abstraction to do that seems totally viable.

The cost assessment in the past was that this wasn't an area we were willing to invest in. It's hyperspecialized just to this narrow domain, and writing out strings is just something individual libraries can do themselves, or utilize a library that someone wants to invest in.

AArnott commented 1 year ago

@Sergio0694 - I agree and this is exactly the scenario we have - which is probably why @AArnott wrote a FastSyntaxFactory which wraps a StringBuilder and avoids the nodes.

My FastSyntaxFactory doesn't do that at all. It merely avoids elastic trivia (IIRC) by explicitly specifying all tokens when forwarding all calls to the regular SyntaxFactory.

jeff-simeon commented 1 year ago

though it would make the code a lot harder to read, and it's relatively complex already.

I don't think ease of readability is a core goal of SGs. Basically, every time this has come up on the team (and i've brought it up a few times), the thinking is that this is all for computers to do, and writing fast string-writing code is acceptable.

If someone wants to make their own api to help here, they can, but it's not an investment area. I proposed options in the past to make things better, but it's all be rejected as there is no capacity on the team to take on new APIs to help out effectively expert scenarios.

Ok, thanks for the reply, but it's a bummer :(

We will have to build our own syntax node model and wrap a StringBuilder at some point.

CyrusNajmabadi commented 1 year ago

I would avoid a syntax node model altogether. It's an unnecessary intermediary abstraction (and often expensive in memory to boot).

jeff-simeon commented 1 year ago

Using a StringBuilder directly would be a maintainability problem. We'll have to build our own model that wraps a StringBuilder....I suppose it will be just like a SyntaxNode model.......but have a fast ToString method on it.

At the end of the day, all abstractions are unnecessary and come at some performance cost. The complexity of what we're doing requires some abstraction from direct string manipulation. That will come at some performance cost. But preferably, not a cost that occupies 40% of our build time.

CyrusNajmabadi commented 1 year ago

Using a StringBuilder directly would be a maintainability problem.

I would be very surprised by that. Writing out code is usually very simple. Here's an example of a helper type i whipped up (similar to lots of stuff we've done before) to make it super easy: https://gist.github.com/CyrusNajmabadi/edf9f6059d8fd2ced7a2086a8835be1d

Just write to this, and tell it when you're increasing or decreasing indentation (and def write helpers/extensions to do it for common cases, like blocks). And it takes care of the rest.

We'll have to build our own model that wraps a StringBuilder

Sure. But it basically does nothing but pass through the data, and just handle indentation so that everything is not left-flushed. Almost no work/overhead necessary at runtime, and very simple to use on the consumer side.

At the end of the day, all abstractions are unnecessary and come at some performance cost.

Sure. But syntax-nodes are a huge abstraction cost. Like huge. IIRC, they're around 5 times the size of the strings they represent. And they require substantial manipulation to get them formatted.

They're really good for dealing with user code that needs to be understood/manipulated. They're really not necessary for a space where the end goal is generating code.

Sergio0694 commented 1 year ago

Changed the benchmark in this issue to use a custom indented string writer, here's the benchmark:

Method	Mean	Error	StdDev	Ratio	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
SyntaxTrees	586.810 us	10.3462 us	9.6778 us	1.000	30.2734	7.8125	0.9766	183.17 KB	1.00
IndentedWriter	1.268 us	0.0151 us	0.0134 us	0.002	1.1959	-	-	5.05 KB	0.03

I think @CyrusNajmabadi has successfully converted me over to the "just write to an indented writer" gang 😄

Benchmark code (click to expand):

```csharp using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using ComputeSharp.SourceGeneration.Helpers; using Microsoft.CodeAnalysis; using Microsoft.CodeAnalysis.CSharp; using Microsoft.CodeAnalysis.CSharp.Syntax; using static Microsoft.CodeAnalysis.CSharp.SyntaxFactory; BenchmarkRunner.Run(); [MemoryDiagnoser] public class NormalizeWhitespaceBenchmark { private readonly string typeName = typeof(NormalizeWhitespaceBenchmark).FullName!; private readonly string typeVersion = typeof(NormalizeWhitespaceBenchmark).Assembly.GetName().Version!.ToString(); private readonly string recipientType = "global::MyViewModel"; private readonly string messageTypeA = "global::MessageA"; private readonly string messageTypeB = "global::MessageB"; private readonly string messageTypeC = "global::MessageC"; [Benchmark] public string IndentedWriter() { using IndentedTextWriter writer = IndentedTextWriter.Rent(); writer.WriteLine("// Licensed to the .NET Foundation under one or more agreements."); writer.WriteLine("// The .NET Foundation licenses this file to you under the MIT license."); writer.WriteLine("// See the LICENSE file in the project root for more information."); writer.WriteLine("#pragma warning disable"); writer.WriteLine("namespace Microsoft.Toolkit.Mvvm.Messaging.__Internals"); using (writer.WriteBlock()) { writer.WriteLine($"""[global::System.CodeDom.Compiler.GeneratedCode("{this.typeName}", "{this.typeVersion}")]"""); writer.WriteLine($"""[global::System.Diagnostics.DebuggerNonUserCode]"""); writer.WriteLine($"""[global::System.Diagnostics.CodeAnalysis.ExcludeFromCodeCoverage]"""); writer.WriteLine($"""[global::System.ComponentModel.EditorBrowsable(global::System.ComponentModel.EditorBrowsableState.Never)]"""); writer.WriteLine($"""[global::System.Obsolete("This type is not intended to be used directly by user code")]"""); writer.WriteLine($"""internal static partial class __IMessengerExtensions"""); using (writer.WriteBlock()) { writer.WriteLine($"""[global::System.ComponentModel.EditorBrowsable(global::System.ComponentModel.EditorBrowsableState.Never)]"""); writer.WriteLine($"""[global::System.Obsolete("This type is not intended to be used directly by user code")]"""); writer.WriteLine($"""public static global::System.Action CreateAllMessagesRegistrator({this.recipientType} _)"""); using (writer.WriteBlock()) { writer.WriteLine($"""static void RegisterAll(IMessenger messenger, object obj)"""); using (writer.WriteBlock()) { writer.WriteLine($"""var recipient = ({this.recipientType})obj;"""); writer.WriteLine($"""messenger.Register<{this.messageTypeA}>(recipient);"""); writer.WriteLine($"""messenger.Register<{this.messageTypeB}>(recipient);"""); writer.WriteLine($"""messenger.Register<{this.messageTypeC}>(recipient);"""); } writer.WriteLine(); writer.WriteLine("return RegisterAll;"); } writer.WriteLine(); writer.WriteLine($"""[global::System.ComponentModel.EditorBrowsable(global::System.ComponentModel.EditorBrowsableState.Never)]"""); writer.WriteLine($"""[global::System.Obsolete("This type is not intended to be used directly by user code")]"""); writer.WriteLine($"""public static global::System.Action CreateAllMessagesRegistratorWithToken({this.recipientType} _)"""); writer.WriteLine($""" where TToken : global::System.IEquatable"""); using (writer.WriteBlock()) { writer.WriteLine("static void RegisterAll(IMessenger messenger, object obj, TToken token)"); using (writer.WriteBlock()) { writer.WriteLine($"""var recipient = ({this.recipientType})obj;"""); writer.WriteLine($"""messenger.Register<{this.messageTypeA}, TToken>(recipient, token);"""); writer.WriteLine($"""messenger.Register<{this.messageTypeB}, TToken>(recipient, token);"""); writer.WriteLine($"""messenger.Register<{this.messageTypeC}, TToken>(recipient, token);"""); } writer.WriteLine(); writer.WriteLine("return RegisterAll;"); } } } return writer.ToString(); } [Benchmark(Baseline = true)] public string SyntaxTrees() { return GetDummyTree().NormalizeWhitespace(eol: "\n").ToFullString(); } private CompilationUnitSyntax GetDummyTree() { // // Licensed to the .NET Foundation under one or more agreements. // // The .NET Foundation licenses this file to you under the MIT license. // // See the LICENSE file in the project root for more information. // // #pragma warning disable // // namespace Microsoft.Toolkit.Mvvm.Messaging.__Internals // { // [global::System.CodeDom.Compiler.GeneratedCode("...", "...")] // [global::System.Diagnostics.DebuggerNonUserCode] // [global::System.Diagnostics.CodeAnalysis.ExcludeFromCodeCoverage] // [global::System.ComponentModel.EditorBrowsable(global::System.ComponentModel.EditorBrowsableState.Never)] // [global::System.Obsolete("This type is not intended to be used directly by user code")] // internal static partial class __IMessengerExtensions // { // [global::System.ComponentModel.EditorBrowsable(global::System.ComponentModel.EditorBrowsableState.Never)] // [global::System.Obsolete("This method is not intended to be called directly by user code")] // public static global::System.Action CreateAllMessagesRegistrator( _) // { // static void RegisterAll(IMessenger messenger, object obj) // { // var recipient = ()obj; // messenger.Register<>(recipient); // messenger.Register<>(recipient); // messenger.Register<>(recipient); // } // // return RegisterAll; // } // // [global::System.ComponentModel.EditorBrowsable(global::System.ComponentModel.EditorBrowsableState.Never)] // [global::System.Obsolete("This method is not intended to be called directly by user code")] // public static global::System.Action CreateAllMessagesRegistratorWithToken( _) // where TToken : global::System.IEquatable // { // static void RegisterAll(IMessenger messenger, object obj, TToken token) // { // var recipient = ()obj; // messenger.Register<, TToken>(recipient, token); // messenger.Register<, TToken>(recipient, token); // messenger.Register<, TToken>(recipient, token); // } // // return RegisterAll; // } // } // } return CompilationUnit().AddMembers( NamespaceDeclaration(IdentifierName("Microsoft.Toolkit.Mvvm.Messaging.__Internals")).WithLeadingTrivia(TriviaList( Comment("// Licensed to the .NET Foundation under one or more agreements."), Comment("// The .NET Foundation licenses this file to you under the MIT license."), Comment("// See the LICENSE file in the project root for more information."), Trivia(PragmaWarningDirectiveTrivia(Token(SyntaxKind.DisableKeyword), true)))).AddMembers( ClassDeclaration("__IMessengerExtensions").AddModifiers( Token(SyntaxKind.InternalKeyword), Token(SyntaxKind.StaticKeyword), Token(SyntaxKind.PartialKeyword)).AddAttributeLists( AttributeList(SingletonSeparatedList( Attribute(IdentifierName($"global::System.CodeDom.Compiler.GeneratedCode")) .AddArgumentListArguments( AttributeArgument(LiteralExpression(SyntaxKind.StringLiteralExpression, Literal(this.typeName))), AttributeArgument(LiteralExpression(SyntaxKind.StringLiteralExpression, Literal(this.typeVersion)))))), AttributeList(SingletonSeparatedList(Attribute(IdentifierName("global::System.Diagnostics.DebuggerNonUserCode")))), AttributeList(SingletonSeparatedList(Attribute(IdentifierName("global::System.Diagnostics.CodeAnalysis.ExcludeFromCodeCoverage")))), AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.ComponentModel.EditorBrowsable")).AddArgumentListArguments( AttributeArgument(ParseExpression("global::System.ComponentModel.EditorBrowsableState.Never"))))), AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.Obsolete")).AddArgumentListArguments( AttributeArgument(LiteralExpression( SyntaxKind.StringLiteralExpression, Literal("This type is not intended to be used directly by user code"))))))).AddMembers( MethodDeclaration( GenericName("global::System.Action").AddTypeArgumentListArguments( IdentifierName("IMessenger"), PredefinedType(Token(SyntaxKind.ObjectKeyword))), Identifier("CreateAllMessagesRegistrator")).AddAttributeLists( AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.ComponentModel.EditorBrowsable")).AddArgumentListArguments( AttributeArgument(ParseExpression("global::System.ComponentModel.EditorBrowsableState.Never"))))), AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.Obsolete")).AddArgumentListArguments( AttributeArgument(LiteralExpression( SyntaxKind.StringLiteralExpression, Literal("This method is not intended to be called directly by user code"))))))).AddModifiers( Token(SyntaxKind.PublicKeyword), Token(SyntaxKind.StaticKeyword)).AddParameterListParameters( Parameter(Identifier("_")).WithType(IdentifierName(this.recipientType))) .WithBody(Block( LocalFunctionStatement( PredefinedType(Token(SyntaxKind.VoidKeyword)), Identifier("RegisterAll")) .AddModifiers(Token(SyntaxKind.StaticKeyword)) .AddParameterListParameters( Parameter(Identifier("messenger")).WithType(IdentifierName("IMessenger")), Parameter(Identifier("obj")).WithType(PredefinedType(Token(SyntaxKind.ObjectKeyword)))) .WithBody(Block( LocalDeclarationStatement( VariableDeclaration(IdentifierName("var")) .AddVariables( VariableDeclarator(Identifier("recipient")) .WithInitializer(EqualsValueClause( CastExpression( IdentifierName(this.recipientType), IdentifierName("obj"))))))) .AddStatements( ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName(this.messageTypeA)))) .AddArgumentListArguments(Argument(IdentifierName("recipient")))), ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName(this.messageTypeB)))) .AddArgumentListArguments(Argument(IdentifierName("recipient")))), ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName(this.messageTypeC)))) .AddArgumentListArguments(Argument(IdentifierName("recipient")))))), ReturnStatement(IdentifierName("RegisterAll")))), MethodDeclaration( GenericName("global::System.Action").AddTypeArgumentListArguments( IdentifierName("IMessenger"), PredefinedType(Token(SyntaxKind.ObjectKeyword)), IdentifierName("TToken")), Identifier("CreateAllMessagesRegistratorWithToken")).AddAttributeLists( AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.ComponentModel.EditorBrowsable")).AddArgumentListArguments( AttributeArgument(ParseExpression("global::System.ComponentModel.EditorBrowsableState.Never"))))), AttributeList(SingletonSeparatedList( Attribute(IdentifierName("global::System.Obsolete")).AddArgumentListArguments( AttributeArgument(LiteralExpression( SyntaxKind.StringLiteralExpression, Literal("This method is not intended to be called directly by user code"))))))).AddModifiers( Token(SyntaxKind.PublicKeyword), Token(SyntaxKind.StaticKeyword)).AddParameterListParameters( Parameter(Identifier("_")).WithType(IdentifierName(this.recipientType))) .AddTypeParameterListParameters(TypeParameter("TToken")) .AddConstraintClauses( TypeParameterConstraintClause("TToken") .AddConstraints(TypeConstraint(GenericName("global::System.IEquatable").AddTypeArgumentListArguments(IdentifierName("TToken"))))) .WithBody(Block( LocalFunctionStatement( PredefinedType(Token(SyntaxKind.VoidKeyword)), Identifier("RegisterAll")) .AddModifiers(Token(SyntaxKind.StaticKeyword)) .AddParameterListParameters( Parameter(Identifier("messenger")).WithType(IdentifierName("IMessenger")), Parameter(Identifier("obj")).WithType(PredefinedType(Token(SyntaxKind.ObjectKeyword))), Parameter(Identifier("token")).WithType(IdentifierName("TToken"))) .WithBody(Block( LocalDeclarationStatement( VariableDeclaration(IdentifierName("var")) .AddVariables( VariableDeclarator(Identifier("recipient")) .WithInitializer(EqualsValueClause( CastExpression( IdentifierName(this.recipientType), IdentifierName("obj"))))))) .AddStatements( ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName(this.messageTypeA), IdentifierName("TToken")))) .AddArgumentListArguments(Argument(IdentifierName("recipient")), Argument(IdentifierName("token")))), ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName(this.messageTypeB), IdentifierName("TToken")))) .AddArgumentListArguments(Argument(IdentifierName("recipient")), Argument(IdentifierName("token")))), ExpressionStatement( InvocationExpression( MemberAccessExpression( SyntaxKind.SimpleMemberAccessExpression, IdentifierName("messenger"), GenericName(Identifier("Register")).AddTypeArgumentListArguments( IdentifierName(this.messageTypeC), IdentifierName("TToken")))) .AddArgumentListArguments(Argument(IdentifierName("recipient")), Argument(IdentifierName("token")))))), ReturnStatement(IdentifierName("RegisterAll"))))))); } } ```

CyrusNajmabadi commented 1 year ago

@Sergio0694 that matches my expectations. Using syntax trees at all is just hugely unnecessary overhead. They're gargantuan in memory, and involve an enormous amount of pointer chasing.

They're also just not a great intermediary construct for the domain of just wanting to write out a linearized string.

They are great if you want to actually operate on them in a tree-structured fashion. But I don't think I've ever seen a generator that actually does that. Rather, all generators just write the content start to finish. Or they make subsections, but then concatenate then linearly. All these cases are much better served with just writing a string.

Personally, I think writing the string is much easier and clearer too, as long as you have the right writer-abstraction (like the one I linked to).

AArnott commented 1 year ago

That is indeed compelling data.

FWIW, CsWin32 does emit node trees, and we modify them along the way. It would be a total rewrite (of a huge codebase) to switch to StringBuilder both because all our SyntaxFactory callers would have to change and because we'd essentially have to decide up front everything we need to do before starting to emit code because we wouldn't be able to emit nodes and modify them later as we do now. It might make a reasonable project to tackle by combining it with supporting incremental source generation, since doing that properly also needs a fairly substantial refactoring so that we can truly track all the inputs into the source generator.

CyrusNajmabadi commented 1 year ago

ecause we wouldn't be able to emit nodes and modify them later as we do now.

You can def do this.

Note: as i've mentioned in discord to several people, a very reasonable thing to do is create the indented string chucks for those other nodes up front (and then cache them). Later, when building the full text, it is trivial to add an existing indented node to the final indented writer you are using.

--

I cannot emphasize enough how much of a better approach to SG writing this is. Syntax and trees are a bad abstraction here. They are enormously expensive, unweildy, and designed to be good for very different scenarios.

Personally, i've never once found an SG case that benefits from them. Contract that with string writing. It's blazing fast. Extremely simple. Very easy to compose. And trivial to get the formatting you want within a line, and the indentation we expect for normal C# or VB.

As we can see from the numbers, you should expect roughly two orders of magnitude improvement for CPU and memory, and all using an absolutely trivial type that is super easy to use.

SGs should do this.

dotnet / roslyn

Proposal: SyntaxNode.ToFullStringWithNormalizedWhitespaces API #52914

Overview

Proposal