dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
18.93k stars 4.02k forks source link

Create an 'Indenting String Builder' type to allow users to easily generate C# code that looks good, with almost no overhead. #71162

Open CyrusNajmabadi opened 9 months ago

CyrusNajmabadi commented 9 months ago

Background and Motivation

Today, many SG authors attempt to create code using the SyntaxModel. This is both arduous and extraordinarily slow. Code has to figure out how to properly create C# syntax constructs (non trivial in many cases like indentation, doc comments, and the like), or avoid trivia entirely, and then use NormalizeWhitespace to normalize it. This is extremely slow in practice, and consumes lots of memory with all the intermediary constructs.

Emperical testing with the real world ComputeSharp and MVVM toolkits show a real gain of nearly 100x in terms of CPU usage and 100x in terms of memory usage by having the generators write directly to a string builder.

However, directly using a stringbuilder is slightly unpleasant, as a user has to keep track of the most common construct that makes code legible: indentation.

Proposed API

The sketch of what i'm proposing is a dedicated IndentingStringBuilder that one could use to generate indented code, designed to be good at that job, and otherwise as bare metal as possible.

The major design points are:

  1. Newlines and indentation are automatically handled by the user. As long as the buffer is at the start of the line, an indentation will be handled for the user.
  2. Fast by default. Most calls do almost nothing but some simple checks and passing data through to the buffer.
  3. Easy to add multiple lines in a single call (potentially an opt-in more-costly fashion).
  4. Easy to open blocks { ... } and other indented regions of code.
  5. Low overhead interpolation support.
  6. Common proven patterns for constructs are easy (or provided by extensions). Especially writing out lists of things.
  7. Lowest overhead ability to get a SourceText directly (since we can internally just pass the data around, and avoid intermediary reallocs).

Note: We should move our own Syntax-generator to this, and work with community to try out our prototypes of this to help refine the final shape. As such, we should likely ship this as experimental so we have some time to revise.

Sketch:

namespace Microsoft.CodeAnalysis.CSharp
{
+public struct IndentingStringBuilder : IDisposable
+{
+    public IndentingStringBuilder(string indentationString, string endOfLine);
+    public static IndentingStringBuilder Create(string indentation = DefaultIndentation, string endOfLine = DefaultEndOfLine)
+
     // Releases resources for efficiency
+    public void Dispose();

+    /// <summary>
+    /// Increases the current indentation level, increasing the amount of indentation written at the start of a
+    /// new line when content is written to this.
+    /// </summary>
+    public void IncreaseIndent();

+    /// <summary>
+    /// Decreases the current indentation level, decreasing the amount of indentation written at the start of a
+    /// new line when content is written to it.
+    /// </summary>
+    public void DecreaseIndent();

+    /// Returns a new SourceText with the contents of this writer, and resets this writer to its initial state.
+    public SourceText GetSourceTextAndClear();

+    /// <summary>
+    /// Writes content to the underlying buffer.  If the buffer is at the start of a line, then indentation will be
+    /// appended first before the content.  By default, for performance reasons, the content is assumed to contain no
+    /// end of line characters in it.  If the content may contain end of line characters, then <see langword="true"/>
+    /// should be passed in for <paramref name="splitContent"/>.  This will cause the provided content to be split into
+    /// constituent lines, with each line being appended one at a time.
+    /// </summary>
+    public IndentingStringBuilder Write(string content, bool splitContent = false);

+    /// <inheritdoc cref="Write(string, bool)"/>
+    public IndentingStringBuilder Write(ReadOnlySpan<char> content, bool splitContent = false);

+    /// <summary>
+    /// Equivalent to <see cref="Write(string, bool)"/> except that a final end of line sequence will be written after
+    /// the content is written.
+    /// </summary>
+    public IndentingStringBuilder WriteLine(string content = "", bool splitContent = false);

+    /// <inheritdoc cref="WriteLine(string, bool)"/>
+    public IndentingStringBuilder WriteLine(ReadOnlySpan<char> content, bool splitContent = false);

+    /// <summary>
+    /// Ensures that the current buffer has at least one blank line between the last written content and the content
+    /// that would be written.  Note: a line containing only whitespace/indentation is not considered an empty line.
+    /// Only a line with no content on it counts.
+    /// </summary>
+    /// <returns></returns>
+    public readonly IndentingStringBuilder EnsureEmptyLine();

+    /// <summary>
+    /// Opens a block of code to write new content into, using <c>{</c> and <c>}</c> as the block delimeters.  Can be used like so:
+    /// <code>
+    /// using (writer.StartBlock())
+    /// {
+    ///     write.WriteLine("...");
+    ///     write.WriteLine("...");
+    /// }
+    /// </code>
+    /// </summary>
+    public Region EnterBlock();

+    /// <summary>
+    /// Opens a block of code to write new content into without delimeters.  Can be used like so:
+    /// <code>
+    /// using (writer.StartBlock())
+    /// {
+    ///     write.WriteLine("...");
+    ///     write.WriteLine("...");
+    /// }
+    /// </code>
+    /// </summary>
+    public Region EnterIndentedRegion();

+    public readonly struct Region : IDisposable;

+    public readonly IndentingStringBuilder Write(bool splitContent, [InterpolatedStringHandlerArgument("", nameof(splitContent))] WriteInterpolatedStringHandler handler);
+    public readonly IndentingStringBuilder Write([InterpolatedStringHandlerArgument("")] WriteInterpolatedStringHandler handler)
+    public readonly IndentingStringBuilder WriteLine(bool splitContent, [InterpolatedStringHandlerArgument("", nameof(splitContent))] WriteInterpolatedStringHandler handler);
+    public readonly IndentingStringBuilder WriteLine([InterpolatedStringHandlerArgument("")] WriteInterpolatedStringHandler handler);

    /// <summary>
    /// Provides a handler used by the language compiler to append interpolated strings into <see cref="IndentedTextWriter"/> instances.
    /// </summary>
    [EditorBrowsable(EditorBrowsableState.Never)]
    [InterpolatedStringHandler]
    public readonly ref struct WriteInterpolatedStringHandler
    {
        public WriteInterpolatedStringHandler(int literalLength, int formattedCount, IndentingStringBuilder builder, bool splitContent = false);

        public void AppendLiteral(string literal);
        public void AppendFormatted<T>(T value);
        public void AppendFormatted<T>(T value, string format) where T : IFormattable;
    }
}

Usage Examples

using var builder = IndentedStringBuilder.Create();
builder.AppendLine("using System;");
builder.EnsureBlankLine();
builder.AppendLine("public class C");
using (builder.EnterBlock())
{
    foreach (var member in members)
    {
        builder.AppendLine($"public {member.ReturnType} {member.Name}()");
        using (builder.EnterBlock())
        {
        }
    }
}

API draft here: https://github.com/dotnet/roslyn/pull/71163

You can also imagine helpers/extensions (or instance members) for common patterns. For example:

    public static void WriteBlankLineSeparatedContent<T, TArg>(
        this IndentedTextWriter writer,
        ReadOnlySpan<T> items,
        Action<IndentedStringBuilder, T, TArg> callback);

    public static void WriteSeparatedContent<T, TArg>(
        this IndentedTextWriter writer,
        ReadOnlySpan<T> items,
        Action<IndentedStringBuilder, T, TArg> callback,
        string separator = ", ");

Which could then be used like so:

using var builder = IndentedStringBuilder.Create();
builder.AppendLine("using System;");
builder.EnsureBlankLine();
builder.AppendLine("public class C");
using (builder.EnterBlock())
{
    builder.WriteBlankLineSeparatedContent(
        members,
        static (builder, member, arg) => 
    {
        builder.Append($"public {member.ReturnType} {member.Name}(");
        builder.WriteSeparatedContent(
            member.Parameters,
            static (builder, parameter, arg) => ...);
        builder.AppendLine(")");

        using (builder.EnterBlock())
        {
        }
    }
}

This would help for the common cases of wanting blank lines between things, but not at the start or end. And so on.

Other helpers the community has found useful are:

+    public IndentingStringBuilder WriteIf(bool condition, string content, bool splitContent = false);
+    public IndentingStringBuilder WriteLineIf(bool condition, string? content = "", bool splitContent = false);

Which allows for common patterns of only writing out data if a condition holds, helping to avoid lots of if (...) tests in code before writing things out. It's likely we want this, especially as it woudl allow us to do the interpolation-handler form ourselves, without forcing users to have to provide that.

CyrusNajmabadi commented 9 months ago

Generally speaking, i want us to start with something reasonable. And then drive additional capabilities as additional API requests here.

grant-d commented 9 months ago

Love it. I have needed & have built such a utility several times. Having it in the platform would be great.

Some (small) design feedback. This pattern of signatures would appear to be biased towards C-based languages:

public Region EnterBlock()
        => EnterIndentedRegion("{", "}");

public Region EnterIndentedRegion()
        => EnterIndentedRegion("", "");

private Region EnterIndentedRegion(string open, string close)
...

I believe the final (private) signature is probably more useful as a public member. Even if we limit ourselves to C# we would likely be restricted without it. For example, I may be declaring a complex (multi-line) array, so need to use [ ], or a complex parameter set ( ). In other language such as SQL or VB I may need BEGIN END, or creating a SQL table or procedure (directly or within a C# string) may need ( ). And even a large comment block may be served by using /* */, and a verbatim string by """, """. And so on.

Perhaps:

  // Same, curly is default
  public Region EnterIndentedRegion()
  // Note public
  public Region EnterIndentedRegion(string open, string close)
  // Nice but not necessary
  public Region EnterIndentedRegion(IndentKind kind) // Enum with common delimiters, eg Curly, Round, Square
sharwell commented 9 months ago

This seems like something that would be better suited to individual source generators, and not a public API for Roslyn.

CyrusNajmabadi commented 9 months ago

Some (small) design feedback. This pattern of signatures would appear to be biased towards C-based languages:

Correct. This is intended entirely for use by Roslyn source generators.

CyrusNajmabadi commented 9 months ago

This seems like something that would be better suited to individual source generators, and not a public API for Roslyn.

We have found nearly all our partners needing this. As such, we should provide out so that we don't have N teams having to roll their own version.

Our existing strategy has simply not worked out, with teams seeing massive perf costs trying to do this themselves. We have to lead by example, with best in class libraries and examples they can leverage.

Effectively, we have to start digging ourselves out of the major hole we're in. Hoping for our partners and this parties to pick this up has proven ineffective.

CyrusNajmabadi commented 9 months ago

I believe the final (private) signature is probably more useful as a public member.

Sure. This seems reasonable to me.

jkoritzinsky commented 9 months ago

There's been some push from Roslyn that not all code-generation tools need to be Roslyn source generators. Would there be any appetite for shipping a high-performance intented text writer type outside of Roslyn (possibly in the BCL in an out-of-band assembly) and providing the integration with Roslyn types (in particular the public SourceText GetSourceTextAndClear() method) as an extension method in Roslyn on top of the type?

We're looking at moving to an text writer model for the interop source generators and we'd love to use this type; however, many interop code-gen tools don't need to be source generators (and as per the Roslyn team's advice, will not be source generators). We'd like to share the code between the two scenarios without pulling in references to Roslyn for the non-SG scenarios.

CyrusNajmabadi commented 9 months ago

As an extension, we likely could not get access to the internals needed to make GetSourceTextAndClear fast, which would be an issue.

Would there be any appetite for shipping a high-performance intented text writer type outside of Roslyn

I'd be ok shipping this as a source-package, with the roslyn bits coming in as an additional partial type.

Would there be any appetite for shipping a high-performance intented text writer type outside of Roslyn

Note: outside of roslyn, i don't think you need a high-performance writer. Roslyn needs it because we have to source-gen to get compilations, and any edit may end up rerunning generators. For normal apps that just write out code, any normal path is fine. Heck, use the roslyn syntax-model, with all the allocs/CPU that entails.

In other words, this is to get a generator down from hundreds of MS, to a handful of them, so that those costs don't add up as we run them continuously. For something that runs only on build, saving a few hundred ms is just not that big a deal :)

CyrusNajmabadi commented 9 months ago

We'd like to share the code between the two scenarios without pulling in references to Roslyn for the non-SG scenarios.

Fair. Making this easily source-consumable would be nice.

jkoritzinsky commented 8 months ago

I spoke offline with Cyrus and we came up with a possible alternative solution for the "define outside of Roslyn, consume within Roslyn efficiently" proposal of mine:

Instead of storing a SourceText in the IndentingStringBuilder, the builder would store/produce a ReadOnlySequence<char>. We would have a ReadOnlySequence<char> GetTextAndClear(); method that would transfer out the existing data storage and reset it. Roslyn would provide an extension method SourceText GetSourceTextAndClear(this IndentingStringBuilder builder); which would make a SourceText around the ReadOnlySequence<char> without copying it.

We though ReadOnlySequence<T> would be a good fit as it allows us to represent the generated text like how SourceText abstracts; it allows us to split a buffer to keep it under the LOH limit but still present a good API surface for reading like how StringText vs LongText does in Roslyn. By using a "Get and Clear" method like above, the extension method could assume that there will not be any more writers and as such could avoid an expensive copy.

sharwell commented 8 months ago

There's been some push from Roslyn that not all code-generation tools need to be Roslyn source generators.

Having been the author/maintainer of a non-Roslyn source generator, I would not recommend this path even to someone I disliked. It is unpleasant on the best of days. The integration points are poorly defined, changed numerous times over the years, and are impossible to test to any level of confidence since there is no record of when/how things changed.