Open JeremyKuhne opened 6 years ago
@jkotas, @stephentoub, @KrzysztofCwalina, @danmosemsft, @ahsonkhan
bool terminate
I didn't see that we'd added these arguments. I don't think we should have them. If you want to null terminate, you can call Append('\0')
.
I didn't see that we'd added these arguments. I don't think we should have them. If you want to null terminate, you can call Append('\0').
Discussed this one at length with @jkotas in https://github.com/dotnet/coreclr/pull/16921. The problem with appending a null is that it changes the Length
and breaks the convention on what Length
means for strings
& StringBuilder
(i.e. doesn't include the terminator). It is incredibly error prone as a result (as my need for the mentioned pull demonstrates).
StringBuilder
always terminates, but doing this comes at a cost. It is only really needed in interop calls that take a sz
(most Win32).
The problem with appending a null is that it changes the Length and breaks the convention on what Length means for strings & StringBuilder (i.e. doesn't include the terminator).
The bool terminate
is also changing the Length.
The
bool terminate
is also changing the Length.
No, it leaves Length as is. It might increase Capacity.
Hmm, I see, ok.
Are they needed if there is an overload taking ROSpan?
public void Append(string s); public unsafe void Append(char* value, int length);
Are they needed if there is an overload taking ROSpan?
No, they aren't. I'll add a proposed section.
Can we get C# to allow using ref structs that have a Dispose() in a using statement?
There is a proposal https://github.com/dotnet/csharplang/issues/93 It would be even better with this one https://github.com/dotnet/csharplang/issues/114
There are several related efforts in corfxlab. We should try to unify (or make them consistent). See: https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.ReaderWriter/System/Buffers/Writer/BufferWriter.cs https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.ReaderWriter/System/Buffers/Writer/BufferWriterT_ints.cs https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.ReaderWriter/System/Buffers/Writer/BufferWriter_strings.cs And here support for non-allocating composite formats: https://github.com/dotnet/corefxlab/blob/master/src/System.Text.Formatting/System/Text/Formatting/CompositeFormat.cs
This ValueStringBuilder is a specialized BufferWriter. And so if we add it, we need to rationalize it with BufferWriter, i.e. we should not have both BufferWriter and ValueStringBuilder in .NET.
BufferWriter (the non-generic one) is a byref struct that allows writing (and resizing) to a span buffer. Today, it writes Utf8, but you could imagine writing either Utf8 or utf16 based on some setting. After the buffer is written, tere could be ToString or something like that to produce string result.
@KrzysztofCwalina thanks for the links! I've updated the text with design goals and will review the other efforts.
Should we allow you to make it non-growable? (So we can avoid TryGet above?)
// some reason it isn't, we'll grow the buffer.
In case of growable could be useful to have a boolean properties to know if VSB has grown? In a hot path could be useful avoid to add pressure to underlying ArrayPool
Some questions, I guess...
public void Insert(int index, char value, int count = 1); public void Insert(int index, ReadOnlySpan<char> value, int count = 1); public void Append(char c, int count = 1); public void Append(ReadOnlySpan<char> value);
StringBuilder
API (AppendLine
, string
parameter, et al.) and those are omitted from the post for brevity?this
instance to enable fluent building?Speaking of related efforts... There's also the InplaceStringBuilder
from Microsoft.Extensions.Primitives, which is used in MVC and the HTTP abstractions. Might want to sync up with them as well.
// @pakrym @Tratcher
In case of growable could be useful to have a boolean properties to know if VSB has grown?
Perhaps we should add an EventSource log for this?
@Joe4evr
Are these the only methods to build with, or will there be more parity with the existing StringBuilder API (AppendLine, string parameter, et al.) and those are omitted from the post for brevity?
These were the most critical/obvious (based on current usage). I didn't want to start the discussion with too much complexity. Attacking this in priority order seemed like it would go a little quicker.
Is there a reason these don't return the this instance to enable fluent building?
Hadn't considered it is all. I'll take a look.
fluent building
Fluent pattern adds overhead, and it does not work for structs that needs to be disposed.
Fluent pattern adds overhead, and it does not work for structs that needs to be disposed.
Does it, now? Surely, it's less overhead than display class allocation or something.
Combining two of @khellang's comment https://github.com/dotnet/corefx/issues/20378#issuecomment-304400894 & https://github.com/dotnet/corefx/issues/28379#issuecomment-375975366, it would be super helpful for almost all the .NET developers if we get a comprehensive blog article or Channel 9 video on all these newly optimized string related types in .NET Core and ASP.NET Core detailing when to use which and why. During that exercise if some overlapping types are found, please consider deprecating one. With every upcoming string-related type, it is getting scarier. 😱
it would be super helpful [...] if we get a comprehensive blog article or Channel 9 video on all these newly optimized string related types [...] detailing when to use which and why.
I think the general guidance (once Span<T>
and friends officially hit the streets) is to use Span<T>
(or ReadOnlySpan<T>
, Memory<T>
, ReadOnlyMemory<T>
etc.) whenever you can.
StringSegment
, ArraySegment<T>
and various other related types (some of which has been around for some time) are all efforts to achieve some of the same benefits as the System.Memory types. At this point, I think you could argue they should all be viewed as "obsolete" or "redundant".
StringSegment
is obsolete? https://github.com/dotnet/corefx/issues/20378
No, but with the other new APIs (on ReadOnlySpan<char>
), it's arguable how much value yet another slice-of-string type would bring vs the overhead of the type existing at all. So I would say "redundant" 😉
:confused: The API proposal itself clearly states:
but spans have different type system constraints that not all consumers can -- or want to -- opt into.
I'm assuming @terrajobst was referring to the stack-only-ness of ReadOnlySpan<T>
. That doesn't apply to ReadOnlyMemory<T>
.
There might be valid reasons for yet another slice-of-string type, but as I mentioned in https://github.com/dotnet/corefx/issues/20378#issuecomment-304400894, the number of these types are getting a bit out of hand and it's hard to know which to use and which to support in libraries.
The key thing we're missing that this addresses is a low allocation expandable Span<T>
provider. We could go super simple with something like ref SpanProvider<T>
cutting all string-like methods and perhaps just providing the functionality as extensions. I'm not fundamentally opposed to going that route, but the fact is that we do provide StringBuilder
. It is a concept that people are familiar with and we need a mechanism to replace it for performance sensitive code paths. It seems less burdensome to me to juggle StringBuilder
/ValueStringBuilder
than StringBuilder
/SpanProvider<char>
, particularly as we're already moving down the road of having Value*
value type equivalents of other reference types (e.g. it is an emergent pattern).
I have a question regarding the Dispose() of this structure it should be always called? If so the posted examples are not risking to leaking memory?
I have a question regarding the Dispose() of this structure it should be always called?
It returns any potentially rented buffers to the array pool. ToString()
internally also returns the buffer, I'll call that out. On a related note, we don't usually use these with try ... finally
blocks as in many cases we don't expect throws to be common enough to warrant taking the performance hit of introducing the block- we're ok with not returning the buffer to the pool in those cases (which will get it garbage collected).
Note that I'm exploring alternative options utilizing some of the concepts @KrzysztofCwalina brought up (with @KrzysztofCwalina). We'll circle back around with more info.
@JeremyKuhne have you made progress on exploring new options? I think we should definitely get this into 3.0.
have you made progress on exploring new options?
https://github.com/dotnet/corefxlab/issues/2177 is part of what might negate the need for this. I currently own productizing BufferReader/Writer (https://github.com/dotnet/corefxlab/issues/2166) and have just started in earnest.
I'll be experimenting with how I can provide the same functionality and taking a look at any tradeoffs. ValueStringBuilder is the backup plan if I can't provide a reasonable solution with the other new types.
I think we should definitely get this into 3.0.
I intend to get this or an equivalent in. I really want it as well as not having this sort of type is blocking adding other APIs that I'm very interested in seeing. :)
I'd expect that the API has a parity with StringBuilder:
I'd expect that the API has a parity with StringBuilder
I don't believe that should be a goal. They're meant for different uses and therefore don't need perfect symmetry. For example, ValueStringBuilder is meant for situations where you're trying to minimize allocation overhead as much as possible, and in such situations, you're not going to want to use a method that takes a params object[].
I think more about what users will replace StringBuilder with ValueStringBuilder in existing codes and can have difficulty due to the lack of some methods.
Not params methods, but Clear() is reasonable, and also Remove and Replace given it already has O(n) methods like Insert.
AppendFormat() can be useful too. Example https://source.dot.net/#System.Runtime.WindowsRuntime/System/Windows/UI/Color.cs,102
AppendFormat() does a lot of work (and equally requires a lot of code to implement). This type is more narrowly targeted at scenarios where you want to accept a bit more complexity for the most possible performance, in that scenario you are likely to handle any formatting yourself.
@iSazonov Having format methods on this in the future is definitely something I'm looking at. I'll be putting a prototype of one in corefxlab very soon. Doing formatting in a non-allocating way is a little tricky given boxing, etc. As such it will take a little while to shake out and I don't want to hold up the core functionality here. I'll make sure to tag you on the prototype PR.
StringBuilder
is optimized for convenience and broad range of scenarios. It will continue to work better than ValueStringBuilder
in many case.
ValueStringBuilder
is not optimized for convenience. It is optimized for building small strings (<1kB) where the work required to format the string is small and fixed StringBuilder
overhead is too high.
where the work required to format the string is small
Could you please clarify that do you mean? I'd expect that formatting is intended primarily for people and result string will definitely be smaller 1 Kb like in example I pointed above.
Formatting strings and string interpolation as implemented in .NET today are convenience that you pay quite a bit for.
String formatting/interpolation that is both convenient to use and efficient as handwritten formatting would require C# compiler changes, I think.
We believe that without a feature that forces consumers to pass value types by reference (either language feature or in-box analyzer) this will cause unreliable applications due to corruption of the array pool when accidental copies are being made.
@JeremyKuhne do you want to drive that discussion with @jaredpar?
@JeremyKuhne do you want to drive that discussion with @jaredpar?
Yes, we need to have a way to force structs to be only passed by reference or (possibly) stack only classes. One possibility for forced by-ref structs that has been raised is ref local
.
One possibility for forced by-ref structs that has been raised is
ref local
.
I'm sad it's not ref ref
😉
Since the language team has expressed interest in allowing more features that might need new runtime constructs, it seems best to consider "Non-Copyable structs" (or maybe "ref
Fields" instead) on a list of features that may warrant future-CLR enforcement.
A few months back I implemented basically this, thinking it was a good idea as well. Exact implementation will cause performance to vary, so take it with a grain of salt, but I was found was roughly:
For allocations of 5 chars long, less than 12 allocations was faster to use the traditional StringBuilder
(tSB). For allocations less than 5 chars long, it was almost always faster to use the tSB. For allocations more than 10 chars long, it was almost always faster to use the tSB. For allocations of 5 chars long, only between 12 to 41 allocations were faster with the value/ref StringBuilder
(vSB). For allocations of 8 chars long, only between 18 to 22 allocations were faster for the vSB.
Hopefully you're getting the picture without me needing to clone the old commit and recreate the 3D graphs. The amount of situations where the vSB was actually higher performing were extremely limited, and not clearly or obviously defined.
Furthmore, in every single code base I have where this might have been useful, I worked out ways to just use an array/buffer, where even additional computations or value passing still outperformed the vSB in every single instance.
So, it might be useful, but as a heads up to whoever implements this, you're gonna need to benchmark the ever living **** out of it, and document the small hole where it's actually useful.
@Entomy by faster, presumably you mean allocation. There is also the garbage to consider.
@danmosemsft Right, absolutely. That's basically the point I'm trying to make, is that this is complicated and it seems like there's only a very small, and unusual, range in which this is viable.
there's only a very small, and unusual
Reduction in allocation often has little-to-no measurable impact on serialized benchmark throughput. The benefit of allocation reduction when it comes to throughput is generally about the whole app running at scale.
This is marked blocked because it potentially requires language features. We need a way to enforce that structs are passed only by reference so that we don't double return arrays to the pool.
We should consider making a value based StringBuilder to allow low allocation building of strings. While Span allows you to provide a writable buffer, in many scenarios we have a need to get or build strings and we don't know precisely how much space will be needed ahead of time. Having an abstraction that can grow beyond a given initial buffer is particularly useful as it doesn't require looping with
Try*
APIs- which can be both complicated and have negative performance implications.We currently use if needed.
ValueStringBuilder
for this purpose internally. It starts with an optional initial buffer (which we often stackalloc) and will grow using ArrayPoolDesign Goals
char* szValue
)API
Here is the proposed API:
This is the current shape of our internal
ValueStringBuilder
:Sample Code
Here is a common pattern on an API that could theoretically be made public if ValueStringBuilder was public: (Although we would call this one GetFullUserName or something like that.)
https://github.com/dotnet/corefx/blob/050bc33738887d9d8fcc9bc5965b7d9ca65bc7f4/src/System.Runtime.Extensions/src/System/Environment.Win32.cs#L40-L56
The caller is above this method:
https://github.com/dotnet/corefx/blob/050bc33738887d9d8fcc9bc5965b7d9ca65bc7f4/src/System.Runtime.Extensions/src/System/Environment.Win32.cs#L13-L38
Usage of AppendSpan:
https://github.com/dotnet/corefx/blob/3538128fa1fb2b77a81026934d61cd370a0fd7f5/src/System.Runtime.Numerics/src/System/Numerics/BigNumber.cs#L550-L560
I'll add more usage details and possible API surface area.
Notes
bool TryGet*(Span)
overload? (see https://github.com/dotnet/coreclr/pull/17097/files#r176560435)Dispose()
in ausing
statement? (https://github.com/dotnet/csharplang/issues/93)AppendFormat
overloads?AppendSpan()
is a little tricky to grok- is there a better term/pattern?