dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.16k stars 4.72k forks source link

API Proposal: Add a ValueStringBuilder #25587

Open JeremyKuhne opened 6 years ago

JeremyKuhne commented 6 years ago

We should consider making a value based StringBuilder to allow low allocation building of strings. While Span allows you to provide a writable buffer, in many scenarios we have a need to get or build strings and we don't know precisely how much space will be needed ahead of time. Having an abstraction that can grow beyond a given initial buffer is particularly useful as it doesn't require looping with Try* APIs- which can be both complicated and have negative performance implications.

We currently use ValueStringBuilder for this purpose internally. It starts with an optional initial buffer (which we often stackalloc) and will grow using ArrayPool if needed.

Design Goals

  1. Allow safe usage of stack memory
  2. Use pooled memory when needed to reduce GC pressure
  3. Allow dynamic and explicit capacity growth
  4. Facilitate interop scenarios (i.e. passing as char* szValue)
  5. Follow API semantics of StringBuilder & string where possible
  6. Be stack allocated

API

Here is the proposed API:

namespace System.Text
{
    public ref struct ValueStringBuilder
    {
        public ValueStringBuilder(Span<char> initialBuffer);

        // The logical length of the builder (end of the "string")
        public int Length { get; set; }

        // Available space in chars
        public int Capacity { get; }

        // Ensure there is at least this amount of space
        public void EnsureCapacity(int capacity);

        // Get a pinnable reference to the builder. "terminate" ensures the builder has a null char after Length in the buffer.
        public ref char GetPinnableReference(bool terminate = false);

        // Indexer, allows setting/getting individual chars
        public ref char this[int index] { get; }

        // Returns a string based off of the current position
        public override string ToString();

        // Returns a span around the contents of the builder. "terminate" ensures the builder has a null char after Length in the buffer.
        public ReadOnlySpan<char> AsSpan(bool terminate);

        // To ensure inlining perf, we have a separate overload for terminate
        public ReadOnlySpan<char> AsSpan();

        public bool TryCopyTo(Span<char> destination, out int charsWritten);

        public void Insert(int index, char value, int count = 1);
        public void Insert(int index, ReadOnlySpan<char> value, int count = 1);

        public void Append(char c, int count = 1);
        public void Append(ReadOnlySpan<char> value);

        // This gives you an appended span that you can write to
        public Span<char> AppendSpan(int length);

        // Returns any ArrayPool buffer that may have been rented
        public void Dispose()
    }
}

This is the current shape of our internal ValueStringBuilder:

namespace System.Text
{
    internal ref struct ValueStringBuilder
    {
        public ValueStringBuilder(Span<char> initialBuffer);
        public int Length { get; set; }
        public int Capacity { get; }
        public void EnsureCapacity(int capacity);

        /// <summary>
        /// Get a pinnable reference to the builder.
        /// </summary>
        /// <param name="terminate">Ensures that the builder has a null char after <see cref="Length"/></param>
        public ref char GetPinnableReference(bool terminate = false);
        public ref char this[int index] { get; }

        // Returns a string based off of the current position
        public override string ToString();

        /// <summary>
        /// Returns a span around the contents of the builder.
        /// </summary>
        /// <param name="terminate">Ensures that the builder has a null char after <see cref="Length"/></param>
        public ReadOnlySpan<char> AsSpan(bool terminate);

        // To ensure inlining perf, we have a separate overload for terminate
        public ReadOnlySpan<char> AsSpan();

        public bool TryCopyTo(Span<char> destination, out int charsWritten);
        public void Insert(int index, char value, int count);
        public void Append(char c);
        public void Append(string s);
        public void Append(char c, int count);
        public unsafe void Append(char* value, int length);
        public void Append(ReadOnlySpan<char> value);

        // This gives you an appended span that you can write to
        public Span<char> AppendSpan(int length);

        // Returns any ArrayPool buffer that may have been rented
        public void Dispose()
    }
}

Sample Code

Here is a common pattern on an API that could theoretically be made public if ValueStringBuilder was public: (Although we would call this one GetFullUserName or something like that.)

https://github.com/dotnet/corefx/blob/050bc33738887d9d8fcc9bc5965b7d9ca65bc7f4/src/System.Runtime.Extensions/src/System/Environment.Win32.cs#L40-L56

The caller is above this method:

https://github.com/dotnet/corefx/blob/050bc33738887d9d8fcc9bc5965b7d9ca65bc7f4/src/System.Runtime.Extensions/src/System/Environment.Win32.cs#L13-L38

Usage of AppendSpan:

https://github.com/dotnet/corefx/blob/3538128fa1fb2b77a81026934d61cd370a0fd7f5/src/System.Runtime.Numerics/src/System/Numerics/BigNumber.cs#L550-L560

I'll add more usage details and possible API surface area.

Notes

JeremyKuhne commented 6 years ago

@jkotas, @stephentoub, @KrzysztofCwalina, @danmosemsft, @ahsonkhan

stephentoub commented 6 years ago

bool terminate

I didn't see that we'd added these arguments. I don't think we should have them. If you want to null terminate, you can call Append('\0').

JeremyKuhne commented 6 years ago

I didn't see that we'd added these arguments. I don't think we should have them. If you want to null terminate, you can call Append('\0').

Discussed this one at length with @jkotas in https://github.com/dotnet/coreclr/pull/16921. The problem with appending a null is that it changes the Length and breaks the convention on what Length means for strings & StringBuilder (i.e. doesn't include the terminator). It is incredibly error prone as a result (as my need for the mentioned pull demonstrates).

StringBuilder always terminates, but doing this comes at a cost. It is only really needed in interop calls that take a sz (most Win32).

stephentoub commented 6 years ago

The problem with appending a null is that it changes the Length and breaks the convention on what Length means for strings & StringBuilder (i.e. doesn't include the terminator).

The bool terminate is also changing the Length.

JeremyKuhne commented 6 years ago

The bool terminate is also changing the Length.

No, it leaves Length as is. It might increase Capacity.

stephentoub commented 6 years ago

Hmm, I see, ok.

omariom commented 6 years ago

Are they needed if there is an overload taking ROSpan?

public void Append(string s); public unsafe void Append(char* value, int length);

JeremyKuhne commented 6 years ago

Are they needed if there is an overload taking ROSpan?

No, they aren't. I'll add a proposed section.

omariom commented 6 years ago

Can we get C# to allow using ref structs that have a Dispose() in a using statement?

There is a proposal https://github.com/dotnet/csharplang/issues/93 It would be even better with this one https://github.com/dotnet/csharplang/issues/114

KrzysztofCwalina commented 6 years ago

There are several related efforts in corfxlab. We should try to unify (or make them consistent). See: https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.ReaderWriter/System/Buffers/Writer/BufferWriter.cs https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.ReaderWriter/System/Buffers/Writer/BufferWriterT_ints.cs https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.ReaderWriter/System/Buffers/Writer/BufferWriter_strings.cs And here support for non-allocating composite formats: https://github.com/dotnet/corefxlab/blob/master/src/System.Text.Formatting/System/Text/Formatting/CompositeFormat.cs

This ValueStringBuilder is a specialized BufferWriter. And so if we add it, we need to rationalize it with BufferWriter, i.e. we should not have both BufferWriter and ValueStringBuilder in .NET.

BufferWriter (the non-generic one) is a byref struct that allows writing (and resizing) to a span buffer. Today, it writes Utf8, but you could imagine writing either Utf8 or utf16 based on some setting. After the buffer is written, tere could be ToString or something like that to produce string result.

JeremyKuhne commented 6 years ago

@KrzysztofCwalina thanks for the links! I've updated the text with design goals and will review the other efforts.

MarcoRossignoli commented 6 years ago

Should we allow you to make it non-growable? (So we can avoid TryGet above?)

// some reason it isn't, we'll grow the buffer.

In case of growable could be useful to have a boolean properties to know if VSB has grown? In a hot path could be useful avoid to add pressure to underlying ArrayPool.Shared(today) and improve performance by choosing the right size of initial buffer.

Joe4evr commented 6 years ago

Some questions, I guess...

public void Insert(int index, char value, int count = 1);
public void Insert(int index, ReadOnlySpan<char> value, int count = 1);
public void Append(char c, int count = 1);
public void Append(ReadOnlySpan<char> value);
khellang commented 6 years ago

Speaking of related efforts... There's also the InplaceStringBuilder from Microsoft.Extensions.Primitives, which is used in MVC and the HTTP abstractions. Might want to sync up with them as well.

// @pakrym @Tratcher

JeremyKuhne commented 6 years ago

In case of growable could be useful to have a boolean properties to know if VSB has grown?

Perhaps we should add an EventSource log for this?

JeremyKuhne commented 6 years ago

@Joe4evr

Are these the only methods to build with, or will there be more parity with the existing StringBuilder API (AppendLine, string parameter, et al.) and those are omitted from the post for brevity?

These were the most critical/obvious (based on current usage). I didn't want to start the discussion with too much complexity. Attacking this in priority order seemed like it would go a little quicker.

Is there a reason these don't return the this instance to enable fluent building?

Hadn't considered it is all. I'll take a look.

jkotas commented 6 years ago

fluent building

Fluent pattern adds overhead, and it does not work for structs that needs to be disposed.

Joe4evr commented 6 years ago

Fluent pattern adds overhead, and it does not work for structs that needs to be disposed.

Does it, now? Surely, it's less overhead than display class allocation or something.

ghost commented 6 years ago

Combining two of @khellang's comment https://github.com/dotnet/corefx/issues/20378#issuecomment-304400894 & https://github.com/dotnet/corefx/issues/28379#issuecomment-375975366, it would be super helpful for almost all the .NET developers if we get a comprehensive blog article or Channel 9 video on all these newly optimized string related types in .NET Core and ASP.NET Core detailing when to use which and why. During that exercise if some overlapping types are found, please consider deprecating one. With every upcoming string-related type, it is getting scarier. 😱

khellang commented 6 years ago

it would be super helpful [...] if we get a comprehensive blog article or Channel 9 video on all these newly optimized string related types [...] detailing when to use which and why.

I think the general guidance (once Span<T> and friends officially hit the streets) is to use Span<T> (or ReadOnlySpan<T>, Memory<T>, ReadOnlyMemory<T> etc.) whenever you can.

StringSegment, ArraySegment<T> and various other related types (some of which has been around for some time) are all efforts to achieve some of the same benefits as the System.Memory types. At this point, I think you could argue they should all be viewed as "obsolete" or "redundant".

jnm2 commented 6 years ago

StringSegment is obsolete? https://github.com/dotnet/corefx/issues/20378

khellang commented 6 years ago

No, but with the other new APIs (on ReadOnlySpan<char>), it's arguable how much value yet another slice-of-string type would bring vs the overhead of the type existing at all. So I would say "redundant" 😉

Joe4evr commented 6 years ago

:confused: The API proposal itself clearly states:

but spans have different type system constraints that not all consumers can -- or want to -- opt into.

khellang commented 6 years ago

I'm assuming @terrajobst was referring to the stack-only-ness of ReadOnlySpan<T>. That doesn't apply to ReadOnlyMemory<T>.

There might be valid reasons for yet another slice-of-string type, but as I mentioned in https://github.com/dotnet/corefx/issues/20378#issuecomment-304400894, the number of these types are getting a bit out of hand and it's hard to know which to use and which to support in libraries.

JeremyKuhne commented 6 years ago

The key thing we're missing that this addresses is a low allocation expandable Span<T> provider. We could go super simple with something like ref SpanProvider<T> cutting all string-like methods and perhaps just providing the functionality as extensions. I'm not fundamentally opposed to going that route, but the fact is that we do provide StringBuilder. It is a concept that people are familiar with and we need a mechanism to replace it for performance sensitive code paths. It seems less burdensome to me to juggle StringBuilder/ValueStringBuilder than StringBuilder/SpanProvider<char>, particularly as we're already moving down the road of having Value* value type equivalents of other reference types (e.g. it is an emergent pattern).

fanoI commented 6 years ago

I have a question regarding the Dispose() of this structure it should be always called? If so the posted examples are not risking to leaking memory?

JeremyKuhne commented 6 years ago

I have a question regarding the Dispose() of this structure it should be always called?

It returns any potentially rented buffers to the array pool. ToString() internally also returns the buffer, I'll call that out. On a related note, we don't usually use these with try ... finally blocks as in many cases we don't expect throws to be common enough to warrant taking the performance hit of introducing the block- we're ok with not returning the buffer to the pool in those cases (which will get it garbage collected).

JeremyKuhne commented 6 years ago

Note that I'm exploring alternative options utilizing some of the concepts @KrzysztofCwalina brought up (with @KrzysztofCwalina). We'll circle back around with more info.

ViktorHofer commented 6 years ago

@JeremyKuhne have you made progress on exploring new options? I think we should definitely get this into 3.0.

JeremyKuhne commented 6 years ago

have you made progress on exploring new options?

https://github.com/dotnet/corefxlab/issues/2177 is part of what might negate the need for this. I currently own productizing BufferReader/Writer (https://github.com/dotnet/corefxlab/issues/2166) and have just started in earnest.

I'll be experimenting with how I can provide the same functionality and taking a look at any tradeoffs. ValueStringBuilder is the backup plan if I can't provide a reasonable solution with the other new types.

I think we should definitely get this into 3.0.

I intend to get this or an equivalent in. I really want it as well as not having this sort of type is blocking adding other APIs that I'm very interested in seeing. :)

vancem commented 5 years ago
iSazonov commented 5 years ago

I'd expect that the API has a parity with StringBuilder:

stephentoub commented 5 years ago

I'd expect that the API has a parity with StringBuilder

I don't believe that should be a goal. They're meant for different uses and therefore don't need perfect symmetry. For example, ValueStringBuilder is meant for situations where you're trying to minimize allocation overhead as much as possible, and in such situations, you're not going to want to use a method that takes a params object[].

iSazonov commented 5 years ago

I think more about what users will replace StringBuilder with ValueStringBuilder in existing codes and can have difficulty due to the lack of some methods.

danmoseley commented 5 years ago

Not params methods, but Clear() is reasonable, and also Remove and Replace given it already has O(n) methods like Insert.

iSazonov commented 5 years ago

AppendFormat() can be useful too. Example https://source.dot.net/#System.Runtime.WindowsRuntime/System/Windows/UI/Color.cs,102

danmoseley commented 5 years ago

AppendFormat() does a lot of work (and equally requires a lot of code to implement). This type is more narrowly targeted at scenarios where you want to accept a bit more complexity for the most possible performance, in that scenario you are likely to handle any formatting yourself.

JeremyKuhne commented 5 years ago

@iSazonov Having format methods on this in the future is definitely something I'm looking at. I'll be putting a prototype of one in corefxlab very soon. Doing formatting in a non-allocating way is a little tricky given boxing, etc. As such it will take a little while to shake out and I don't want to hold up the core functionality here. I'll make sure to tag you on the prototype PR.

jkotas commented 5 years ago

StringBuilder is optimized for convenience and broad range of scenarios. It will continue to work better than ValueStringBuilder in many case.

ValueStringBuilder is not optimized for convenience. It is optimized for building small strings (<1kB) where the work required to format the string is small and fixed StringBuilder overhead is too high.

iSazonov commented 5 years ago

where the work required to format the string is small

Could you please clarify that do you mean? I'd expect that formatting is intended primarily for people and result string will definitely be smaller 1 Kb like in example I pointed above.

jkotas commented 5 years ago

Formatting strings and string interpolation as implemented in .NET today are convenience that you pay quite a bit for.

String formatting/interpolation that is both convenient to use and efficient as handwritten formatting would require C# compiler changes, I think.

terrajobst commented 5 years ago

Video

We believe that without a feature that forces consumers to pass value types by reference (either language feature or in-box analyzer) this will cause unreliable applications due to corruption of the array pool when accidental copies are being made.

@JeremyKuhne do you want to drive that discussion with @jaredpar?

JeremyKuhne commented 5 years ago

@JeremyKuhne do you want to drive that discussion with @jaredpar?

Yes, we need to have a way to force structs to be only passed by reference or (possibly) stack only classes. One possibility for forced by-ref structs that has been raised is ref local.

terrajobst commented 5 years ago

One possibility for forced by-ref structs that has been raised is ref local.

I'm sad it's not ref ref 😉

Joe4evr commented 5 years ago

Since the language team has expressed interest in allowing more features that might need new runtime constructs, it seems best to consider "Non-Copyable structs" (or maybe "ref Fields" instead) on a list of features that may warrant future-CLR enforcement.

Entomy commented 4 years ago

A few months back I implemented basically this, thinking it was a good idea as well. Exact implementation will cause performance to vary, so take it with a grain of salt, but I was found was roughly:

For allocations of 5 chars long, less than 12 allocations was faster to use the traditional StringBuilder (tSB). For allocations less than 5 chars long, it was almost always faster to use the tSB. For allocations more than 10 chars long, it was almost always faster to use the tSB. For allocations of 5 chars long, only between 12 to 41 allocations were faster with the value/ref StringBuilder (vSB). For allocations of 8 chars long, only between 18 to 22 allocations were faster for the vSB.

Hopefully you're getting the picture without me needing to clone the old commit and recreate the 3D graphs. The amount of situations where the vSB was actually higher performing were extremely limited, and not clearly or obviously defined.

Furthmore, in every single code base I have where this might have been useful, I worked out ways to just use an array/buffer, where even additional computations or value passing still outperformed the vSB in every single instance.

So, it might be useful, but as a heads up to whoever implements this, you're gonna need to benchmark the ever living **** out of it, and document the small hole where it's actually useful.

danmoseley commented 4 years ago

@Entomy by faster, presumably you mean allocation. There is also the garbage to consider.

Entomy commented 4 years ago

@danmosemsft Right, absolutely. That's basically the point I'm trying to make, is that this is complicated and it seems like there's only a very small, and unusual, range in which this is viable.

stephentoub commented 4 years ago

there's only a very small, and unusual

Reduction in allocation often has little-to-no measurable impact on serialized benchmark throughput. The benefit of allocation reduction when it comes to throughput is generally about the whole app running at scale.

danmoseley commented 4 years ago

This is marked blocked because it potentially requires language features. We need a way to enforce that structs are passed only by reference so that we don't double return arrays to the pool.