Closed stephentoub closed 1 year ago
Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.
Author: | stephentoub |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Runtime` |
Milestone: | 8.0.0 |
Looks very convenient! Thanks!
Please clarify, if we stackalloc N+1 will last element contains full tail?
Based on "the consumer knows how many split values are expected, wants to extract the Nth value, etc." and examples above maybe return an enum for more readability? E.g. return "Success", "Overflow", ... (and result count in out parameter if needed)?
@stephentoub The API shown above is close to what we have developed internally. But it's missing a variant which has proven popular, which is a callback approach. The callback eliminates the need to allocate any span and lets the code process things on-the-fly in a very efficient way.
Here's the full API surface we implemented:
public static partial class StringSplitExtensions
{
// core overloads, matching in general combination of features what String.Split offers
public static bool TrySplit(this ReadOnlySpan<char> input, char separator, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this ReadOnlySpan<char> input, ReadOnlySpan<char> separators, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this ReadOnlySpan<char> input, string[] separators, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this ReadOnlySpan<char> input, string separator, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this ReadOnlySpan<char> input, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
// simple wrappers for the above intended to help IntelliSense discovery
public static bool TrySplit(this string input, char separator, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this string input, ReadOnlySpan<char> separators, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this string input, string[] separators, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this string input, string separator, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static bool TrySplit(this string input, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
// a callback invoked as ranges are discovered in the input
public delegate void SplitVisitor<T>(ReadOnlySpan<char> split, int rangeNum, T context);
// similar set of overloads as above, but this time with callback semantics
public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, char separator, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, ReadOnlySpan<char> separators, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, string[] separators, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, string separator, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
// and once again for strings, to help IntelliSense discovery
public static void VisitSplits<TContext>(this string input, char separator, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this string input, ReadOnlySpan<char> separators, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>(this string input, string[] separators, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>( this string input, string separator, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
public static void VisitSplits<TContext>( this string input, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
}
// makes working with the above cleaner for app code
public static class RangeExtensions
{
public static ReadOnlySpan<char> AsSpan(this string value, Range range) => value.AsSpan(range.Index, range.Count);
public static string Substring(this string value, Range range) => value.AsSpan(range).ToString();
public static ReadOnlySpan<char> Slice(this ReadOnlySpan<char> value, Range range) => value.Slice(range.Index, range.Count);
}
But it's missing a variant which has proven popular, which is a callback approach.
Why is a callback approach better than an enumeration approach ala https://github.com/dotnet/runtime/issues/934#issuecomment-1165864043? With enumeration you're not constantly invoking delegates, you don't have to worry about complicated mechanisms to thread state into the delegate (or incur allocation from state captured implicitly), you have normal language semantics available around breaking out of the loop early if desired, etc.
True, I initially didn't consider the enumeration pattern. That does look better.
Please clarify, if we stackalloc N+1 will last element contains full tail?
Let's say you have the input "a,b,c"
and a separator ,
. My intent with the API is you'd get the following...
Calling this with dest.Length == 1, you'd get: dest[0] == 0..5 returning 1.
Calling this with dest.Length == 2, you'd get: dest[0] == 0..1 dest[1] == 2..5 returning 2.
Calling this with dest.Length == 3, you'd get: dest[0] == 0..1 dest[1] == 2..3 dest[2] == 4..5 returning 3.
Calling this with dest.Length == 4, you'd get: dest[0] == 0..1 dest[1] == 2..3 dest[2] == 4..5 dest[3] == unwritten returning 3.
Basically exactly the same semantics as string.Split where count == dest.Length. This should help make it easier for code to switch from string.Split.
Thanks for clarify! Look very good. This immediately triggers the need for #934 in cases where we do not know how many split values are expected.
I have concerns about the examples called out in the original message. In both example (1) and example (2), I find the "after" code to be less readable than the "before" code.
(And in fact the "after" code for example (1) could stack overflow the process unless the stackalloc is moved to the top of the method, which means the declaration is significantly separated from its usage.)
This might just be due to the mental shift from needing to stackalloc N + 1 entries if you're trying to match exactly N elements. This pattern doesn't really match anything else in .NET and seems like it presents opportunity for callers to introduce bugs into their own code.
Do we need an analyzer or something else here to help callers avoid bugs?
This might just be due to the mental shift from needing to stackalloc N + 1 entries if you're trying to match exactly N elements. This pattern doesn't really match anything else in .NET and seems like it presents opportunity for callers to introduce bugs into their own code.
This ends up mapping exactly to the semantics of the existing string.Split when you provide a count. In fact, in my local branch for this, I just augmented all of the existing string.Split tests to also test the new method. So anywhere you provide a count to string.Split today, if you pass in new Range[count]
, that array will be filled in with ranges for exactly the same subsets as are returned in the new string[]
array with the method today. If you pass in N+1 to string.Split today, you'll get back a string[] with Length == N, and with the new method, count == N.
I find the "after" code to be less readable than the "before" code.
It may be, but it also allocates much less. What would you recommend instead? This is as close as I could come to what we have today with string.Split while still being allocation-free and achieving all of the same scenarios where the expected number of parts is known in advance (for cases where there's no upper bound known on the number of parts, we can use an enumeration model as outlined in the other issue, but that's very difficult to work with for the common scenarios where you do know the number of parts in advance, want to validate the number of parts parsed, etc.)
What would you recommend instead?
I don't have a general-purpose recommendation appropriate for BCL, unfortunately. Like many others, I wrote my own helpers for use within my own projects, but they're opinionated. For example, here's a sketch of a helper I have which allows writing (var a, var b, ...) = theSpan.Split(',');
and matching exactly the number of elements you request.
static SplitHelper MemoryExtensions.Split(this ROS<char> text, char separator);
ref struct SplitHelper
{
[EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b);
[EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b, out ROS<char> c);
[EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b, out ROS<char> c, out ROS<char> d);
// ...
}
Since you're trying to support considerably more scenarios (variable number of matches, various failure conditions, etc.) these opinionated helpers won't really help.
But back to the issue at hand, we do have several APIs which take a destination buffer as an argument and populate that buffer to the best of its ability, returning an error if the total number of elements would overflow the buffer. Many of the APIs on MemoryExtensions
do just that. My concern is that people might say to themselves "oh, this takes a destination buffer as a parameter, and I'm familiar with that existing pattern!" - expecting the API to fail on overrun, since that's the pattern most closely matched by the proposed API surface.
But the proposed API actually matches the behavior of string.Split
, where tail elements are collapsed into the final returned element. This behavior makes sense for people accustomed to string.Split
, but not for people accustomed to all the rest of the MemoryExtensions
surface area.
This mismatch strikes me as something that will confuse our consumers and introduce bugs into their apps. But unfortunately I don't have any good suggestions for you. Maybe the best answer really is just to ship it and ensure the docs are air-tight.
Thanks.
Maybe the best answer really is just to ship it and ensure the docs are air-tight.
π
This mismatch strikes me as something that will confuse our consumers and introduce bugs into their apps.
My first thoughts were about Try...()
patterns too but description of the proposal explicitly explains that it is overhead.
To reduce the chance of misunderstanding we could (1) make this completely consistent with #934, (2) by finding more informative names for methods and parameters.
For example, using "Range" in "SplitAsRange" looks superfluous since this information is already in the result type. Perhaps there is a way to briefly say that we are getting exactly the split we are asking for? Maybe SplitExactly
?
AsRanges()
suffixnamespace System;
public static class MemoryExtensions
{
public static int Split(this ReadOnlySpan<char> source, Span<Range> destination, char separator, StringSplitOptions options = StringSplitOptions.None);
public static int Split(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separator, StringSplitOptions options = StringSplitOptions.None);
public static int SplitAny(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separators, StringSplitOptions options = StringSplitOptions.None);
public static int SplitAny(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<string> separators, StringSplitOptions options = StringSplitOptions.None);
}
Is there any reason we didn't consider byte overloads? Our utf8 plans have shifted from a separate type to just embracing byte in more APIs, feels like we should also add ReadOnlySpan<byte>
overloads as well?
We discussed it. General conclusion was we'd be fine adding such overloads subsequently if ASP.NET or others have demonstrated places where it'd be used and we'd include something in the signature (method name, parameter names, Encoding parameter, whatever) that indicated we're dealing with UTF8 text (as opposed to, for example, just having a generic overload that happened to work with bytes).
Make sense to consider using IndexOfAnyValues in the APIs?
Make sense to consider using IndexOfAnyValues in the APIs?
I don't see how, but there's already a PR up; feel free to comment on if you think there's a missed opportunity.
Hmm, if classic Split now uses IndexOfAnyValues to cache probabilistic map why the API cannot?
if classic Split now uses IndexOfAnyValues to cache probabilistic map
It doesn't. And in the PR you'll see the APIs are calling into the exact same functions that string.Split does to find the separators.
It would be nice if the API made it easy to tell if there is a possibility of more ranges or the API walked to the end of the source. When the API returns destination.Length, it seems to be difficult to tell whether the walk is done or not. i.e. should the API return OperationStatus?
When the API returns destination.Length, it seems to be difficult to tell whether the walk is done or not.
This is why for that use case the intent is to pass in a size one larger than the number of ranges you care about. That way, if it returns the desired number of ranges, you know there wasn't anything extraneous. You can see several examples of this in the PR providing the implementation here: https://github.com/dotnet/runtime/pull/79048/files
I am talking about a situation when I don't know how many ranges I care about. I want to process all segments and I don't know how many segments there are. With the current APIs, I have to write something like the following
while(true) {
var count = source.Split(destination, ',');
for (int i=0; i < count; i++)
{
Console.WriteLine(Encoding.UTF8.GetString(source[destination[i]]));
}
if (count != size) {
break;
}
source = source.Slice(destination[size - 1].End.Value + 1);
}
with OperationStatus-returning APIs:
while(true) {
var status = source.Split(destination, ',', out int written, out int read);
for (int i = 0; i < written; i++)
{
Console.WriteLine(Encoding.UTF8.GetString(source[destination[i]]));
}
if (status == OperationStatus.Done) break;
source = source.Slice(read);
}
Having said that, I think the main help I had in mind is in the read
out parameter. The source.Slice(destination[size - 1].End.Value + 1)
slice statement took me some time to figure out.
I am talking about a situation when I don't know how many ranges I care about.
This API isn't for that. https://github.com/dotnet/runtime/issues/934 is for that. Your example with the proposed API there then would be more like:
foreach (Range r in source.Split(','))
{
Console.WriteLine(Encoding.UTF8.GetString(source[r]));
}
Do we have an API to fix @KrzysztofCwalina bracing style? π
@terrajobst, I am just following the guidelines in FDG ed 2 page 364 section A.1.1. Should I not follow these? I not, then how do I [cherry] pick which guidelines to follow? :-)
Do those guidelines use different bracing styles for for-loops and while-loops? π±π³
Quoting the last guideline form page 364: "Do place the closing brace on its own line unless followed by an else, else if, or while statement." ... so the answer is yes.
Background and motivation
https://github.com/dotnet/runtime/issues/934 has been a long-standing issue about Split support for spans. It's evolved into an enumerator that wraps IndexOf and Slice into a slightly tidier package. While that might still be useful, it doesn't make many of the existing use cases for Split very simple, in particular ones where the consumer knows how many split values are expected, wants to extract the Nth value, etc.
Either instead of or in addition to (if we also want an enumerator syntax), we can offer a SplitAsRanges method that operates over
ReadOnlySpan<T>
in a way and stores the resulting ranges into a provided location, that then also works withSpan<T>
,{ReadOnly}Memory<T>
, andString
, that doesn't allocate, that automates the retrieval of N values, etc.API Proposal
Spit{Any}
, and have the "ranges" aspect of it be implicit in taking aSpan<Range>
parameter.destination, separator
vsseparator, destination
? I went withdestination, separator
so that the configuration-related data (separator and options) are next to each other, but that then does differ from string.Split, where the separator is first.System.Range
values written into destination. Use that wants to retrieve N segments regardless of whether there are more can stackalloc a span / allocate an array of NRange
instance. Use that wants to retrieve N segments and guarantee there are no more than that can stackalloc a span / allocate an array of N+1Range
instances, and validate that the returned count was N.System.Range
isunmanaged
and can be stackalloc'd.Range
instances can be used to slice the original span/memory/string etc. to extract only those values that are needed, in either an allocating or non-allocating manner.API Usage
Examples...
Instead of:
this code could be:
Instead of:
this could be:
Instead of:
this could be:
Alternative Designs
https://github.com/dotnet/runtime/issues/934#issuecomment-1165864043
Risks
No response