[API Proposal]: Alternative span/memory/string splitting API

stephentoub commented 2 years ago

Background and motivation

https://github.com/dotnet/runtime/issues/934 has been a long-standing issue about Split support for spans. It's evolved into an enumerator that wraps IndexOf and Slice into a slightly tidier package. While that might still be useful, it doesn't make many of the existing use cases for Split very simple, in particular ones where the consumer knows how many split values are expected, wants to extract the Nth value, etc.

Either instead of or in addition to (if we also want an enumerator syntax), we can offer a SplitAsRanges method that operates over ReadOnlySpan<T> in a way and stores the resulting ranges into a provided location, that then also works with Span<T>, {ReadOnly}Memory<T>, and String, that doesn't allocate, that automates the retrieval of N values, etc.

API Proposal

namespace System;

public static class MemoryExtensions
{
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, char separator, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separator, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAnyAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separators, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAnyAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<string> separators, StringSplitOptions options = StringSplitOptions.None);
}

Naming: we could just call these Spit{Any}, and have the "ranges" aspect of it be implicit in taking a Span<Range> parameter.
Argument ordering: destination, separator vs separator, destination? I went with destination, separator so that the configuration-related data (separator and options) are next to each other, but that then does differ from string.Split, where the separator is first.
The methods all return the number of System.Range values written into destination. Use that wants to retrieve N segments regardless of whether there are more can stackalloc a span / allocate an array of N Range instance. Use that wants to retrieve N segments and guarantee there are no more than that can stackalloc a span / allocate an array of N+1 Range instances, and validate that the returned count was N.
System.Range is unmanaged and can be stackalloc'd.
The stored Range instances can be used to slice the original span/memory/string etc. to extract only those values that are needed, in either an allocating or non-allocating manner.

API Usage

Examples...

https://github.com/dotnet/sdk/blob/2f7d6f1928526c29a417c72a9e23497359cbc76f/src/Cli/dotnet/commands/dotnet-workload/install/NetSdkMsiInstallerClient.cs#L159-L169

Instead of:

                        string[] dependentParts = dependent.Split(',');

                        if (dependentParts.Length != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependentParts[1]);

this code could be:

                        Span<Range> dependentParts = stackalloc Range[4];
                        ...
                        if (dependent.AsSpan().SplitAsRanges(dependentParts, ',') != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependent[dependentParts[1]]);

https://github.com/dotnet/iot/blob/a914669b6b5928d246c0f88486fecc72867bcc76/src/devices/Card/Ndef/Record/GeoRecord.cs#L91-L105

Instead of:

            var strLatLong = Uri.Substring(4).Split(',');
            if (strLatLong.Length != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = Convert.ToDouble(strLatLong[0], CultureInfo.InvariantCulture);
                _longitude = Convert.ToDouble(strLatLong[1], CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

this could be:

            Span<Range> strLatLong = stackalloc Range[3];
            ReadOnlySpan<char> span = Uri.AsSpan(4);
            if (span.Split(strLatLong, ',') != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = double.Parse(span[strLatLong[0]], provider: CultureInfo.InvariantCulture);
                _longitude = double.Parse(span[strLatLong[1]], provider: CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

https://github.com/dotnet/runtime/blob/b0920912512aa7e205363b046a1911e6a97df31f/src/libraries/System.Private.CoreLib/src/System/TimeZoneInfo.Unix.NonAndroid.cs#L86-L118

Instead of:

                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#')
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"

                            int firstTabIndex = zoneTabFileLine.IndexOf('\t');
                            if (firstTabIndex >= 0)
                            {
                                int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1);
                                if (secondTabIndex >= 0)
                                {
                                    string timeZoneId;
                                    int startIndex = secondTabIndex + 1;
                                    int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex);
                                    if (thirdTabIndex >= 0)
                                    {
                                        int length = thirdTabIndex - startIndex;
                                        timeZoneId = zoneTabFileLine.Substring(startIndex, length);
                                    }
                                    else
                                    {
                                        timeZoneId = zoneTabFileLine.Substring(startIndex);
                                    }

                                    if (!string.IsNullOrEmpty(timeZoneId))
                                    {
                                        timeZoneIds.Add(timeZoneId);
                                    }
                                }
                            }
                        }
                    }

this could be:

                    Span<Range> ranges = stackalloc Range[4];
                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (zoneTabFileLine.StartsWith('#'))
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"
                            int found = zoneTabFileLine.SplitAsRanges(ranges, '\t');
                            if (found >= 3)
                            {
                                timeZoneId = zoneTabFileLine[ranges[3]];
                                if (timeZoneId.Length != 0)
                                {
                                    timeZoneIds.Add(timeZoneId);
                                }
                            }
                        }
                    }

Alternative Designs

https://github.com/dotnet/runtime/issues/934#issuecomment-1165864043

Risks

No response

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation https://github.com/dotnet/runtime/issues/934 has been a long-standing issue about Split support for spans. It's evolved into an enumerator that wraps IndexOf and Slice into a slightly tidier package. While that might still be useful, it doesn't make many of the existing use cases for Split very simple, in particular ones where the consumer knows how many split values are expected, wants to extract the Nth value, etc. Either instead of or in addition to (if we also want an enumerator syntax), we can offer a SplitAsRanges method that operates over `ReadOnlySpan` in a way and stores the resulting ranges into a provided location, that then also works with `Span`, `{ReadOnly}Memory`, and `String`, that doesn't allocate, that automates the retrieval of N values, etc. ### API Proposal ```diff namespace System; public static class MemoryExtensions { + public static int SplitAsRanges(this ReadOnlySpan source, char separator, StringSplitOptions options, Span destination) where T : IEquatable?; + public static int SplitAsRanges(this ReadOnlySpan source, ReadOnlySpan separator, StringSplitOptions options, Span destination) where T : IEquatable?; + public static int SplitAnyAsRanges(this ReadOnlySpan source, ReadOnlySpan separators, StringSplitOptions options, Span destination) where T : IEquatable?; + public static int SplitAsRanges(this ReadOnlySpan source, T separator, Span destination) where T : IEquatable?; + public static int SplitAsRanges(this ReadOnlySpan source, ReadOnlySpan separator, Span destination) where T : IEquatable?; + public static int SplitAnyAsRanges(this ReadOnlySpan source, ReadOnlySpan separators, Span destination) where T : IEquatable?; // I'm not sure it's worth having the following (just to enable extension syntax on `Span`), but we could for consistency. // My expection is the vast majority use will be with `ReadOnlySpan` due to `string.AsSpan()` producing that. + public static int SplitAsRanges(this Span source, T separator, Span destination) where T : IEquatable?; + public static int SplitAsRanges(this Span source, ReadOnlySpan separator, Span destination) where T : IEquatable?; + public static int SplitAnyAsRanges(this Span source, ReadOnlySpan separators, Span destination) where T : IEquatable?; } ``` - The methods all return the number of `System.Range` values written into destination. - `System.Range` is `unmanaged` and can be stackalloc'd. - Use that wants to retrieve N segments regardless of whether there are more can stackalloc a span / allocate an array of N `Range` instance. - Use that wants to retrieve N segments and guarantee there are no more than that can stackalloc a span / allocate an array of N+1 `Range` instances, and validate that the returned count was N. - The stored `Range` instances can be used to slice the original span/memory/string etc. to extract only those values that are needed, in either an allocating or non-allocating manner. ### API Usage Examples... 1. https://github.com/dotnet/sdk/blob/2f7d6f1928526c29a417c72a9e23497359cbc76f/src/Cli/dotnet/commands/dotnet-workload/install/NetSdkMsiInstallerClient.cs#L159-L169 Instead of: ```C# string[] dependentParts = dependent.Split(','); if (dependentParts.Length != 3) { Log?.LogMessage($"Skipping dependent: {dependent}"); continue; } try { SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependentParts[1]); ``` this code could be: ```C# Span dependentParts = stackalloc Range[4]; ... if (dependent.AsSpan().SplitAsRanges(',', dependentParts) != 3) { Log?.LogMessage($"Skipping dependent: {dependent}"); continue; } try { SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependent[dependentParts[1]]); ``` 2. https://github.com/dotnet/iot/blob/a914669b6b5928d246c0f88486fecc72867bcc76/src/devices/Card/Ndef/Record/GeoRecord.cs#L91-L105 Instead of: ```C# var strLatLong = Uri.Substring(4).Split(','); if (strLatLong.Length != 2) { throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload"); } try { _latitude = Convert.ToDouble(strLatLong[0], CultureInfo.InvariantCulture); _longitude = Convert.ToDouble(strLatLong[1], CultureInfo.InvariantCulture); } catch (Exception ex) when (ex is FormatException || ex is OverflowException) { throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload"); } ``` this could be: ```C# Span strLatLong = stackalloc Range[3]; ReadOnlySpan span = Uri.AsSpan(4); if (span.Split(',', strLatLong) != 2) { throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload"); } try { _latitude = double.Parse(span[strLatLong[0], provider: CultureInfo.InvariantCulture); _longitude = double.Parse(span[strLatLong[1], provider: CultureInfo.InvariantCulture); } catch (Exception ex) when (ex is FormatException || ex is OverflowException) { throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload"); } ``` 3. https://github.com/dotnet/runtime/blob/b0920912512aa7e205363b046a1911e6a97df31f/src/libraries/System.Private.CoreLib/src/System/TimeZoneInfo.Unix.NonAndroid.cs#L86-L118 Instead of: ```C# while ((zoneTabFileLine = sr.ReadLine()) != null) { if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#') { // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments" int firstTabIndex = zoneTabFileLine.IndexOf('\t'); if (firstTabIndex >= 0) { int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1); if (secondTabIndex >= 0) { string timeZoneId; int startIndex = secondTabIndex + 1; int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex); if (thirdTabIndex >= 0) { int length = thirdTabIndex - startIndex; timeZoneId = zoneTabFileLine.Substring(startIndex, length); } else { timeZoneId = zoneTabFileLine.Substring(startIndex); } if (!string.IsNullOrEmpty(timeZoneId)) { timeZoneIds.Add(timeZoneId); } } } } } ``` this could be: ```C# Span ranges = stackalloc Range[4]; while ((zoneTabFileLine = sr.ReadLine()) != null) { if (zoneTabFileLine.StartsWith('#')) { // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments" int found = zoneTabFileLine.SplitAsRanges('\t', ranges); if (found >= 3) { timeZoneId = zoneTabFileLine[ranges[3]]; if (timeZoneId.Length != 0) { timeZoneIds.Add(timeZoneId); } } } } ``` ### Alternative Designs https://github.com/dotnet/runtime/issues/934#issuecomment-1165864043 ### Risks _No response_

Author:	stephentoub
Assignees:	-
Labels:	`api-suggestion`, `area-System.Runtime`
Milestone:	8.0.0

iSazonov commented 2 years ago

Looks very convenient! Thanks!

Please clarify, if we stackalloc N+1 will last element contains full tail?

Based on "the consumer knows how many split values are expected, wants to extract the Nth value, etc." and examples above maybe return an enum for more readability? E.g. return "Success", "Overflow", ... (and result count in out parameter if needed)?

geeknoid commented 1 year ago

@stephentoub The API shown above is close to what we have developed internally. But it's missing a variant which has proven popular, which is a callback approach. The callback eliminates the need to allocate any span and lets the code process things on-the-fly in a very efficient way.

Here's the full API surface we implemented:

public static partial class StringSplitExtensions
{
    // core overloads, matching in general combination of features what String.Split offers
    public static bool TrySplit(this ReadOnlySpan<char> input, char separator, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, ReadOnlySpan<char> separators, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, string[] separators, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, string separator, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this ReadOnlySpan<char> input, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);

    // simple wrappers for the above intended to help IntelliSense discovery
    public static bool TrySplit(this string input, char separator, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, ReadOnlySpan<char> separators, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, string[] separators, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, string separator, Span<Range> result, out int numRanges, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static bool TrySplit(this string input, Span<Range> result, out int numRanges, StringSplitOptions options = StringSplitOptions.None);

    // a callback invoked as ranges are discovered in the input
    public delegate void SplitVisitor<T>(ReadOnlySpan<char> split, int rangeNum, T context);

    // similar set of overloads as above, but this time with callback semantics
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, char separator, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, ReadOnlySpan<char> separators, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, string[] separators, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, string separator, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this ReadOnlySpan<char> input, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);

    // and once again for strings, to help IntelliSense discovery
    public static void VisitSplits<TContext>(this string input, char separator, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this string input, ReadOnlySpan<char> separators, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>(this string input, string[] separators, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>( this string input, string separator, TContext context, SplitVisitor<TContext> visitor, StringComparison comparison = StringComparison.Ordinal, StringSplitOptions options = StringSplitOptions.None);
    public static void VisitSplits<TContext>( this string input, TContext context, SplitVisitor<TContext> visitor, StringSplitOptions options = StringSplitOptions.None);
}

// makes working with the above cleaner for app code
public static class RangeExtensions
{
    public static ReadOnlySpan<char> AsSpan(this string value, Range range) => value.AsSpan(range.Index, range.Count);
    public static string Substring(this string value, Range range) => value.AsSpan(range).ToString();
    public static ReadOnlySpan<char> Slice(this ReadOnlySpan<char> value, Range range) => value.Slice(range.Index, range.Count);
}

stephentoub commented 1 year ago

But it's missing a variant which has proven popular, which is a callback approach.

Why is a callback approach better than an enumeration approach ala https://github.com/dotnet/runtime/issues/934#issuecomment-1165864043? With enumeration you're not constantly invoking delegates, you don't have to worry about complicated mechanisms to thread state into the delegate (or incur allocation from state captured implicitly), you have normal language semantics available around breaking out of the loop early if desired, etc.

geeknoid commented 1 year ago

True, I initially didn't consider the enumeration pattern. That does look better.

stephentoub commented 1 year ago

Please clarify, if we stackalloc N+1 will last element contains full tail?

Let's say you have the input "a,b,c" and a separator ,. My intent with the API is you'd get the following...

Calling this with dest.Length == 1, you'd get: dest[0] == 0..5 returning 1.

Calling this with dest.Length == 2, you'd get: dest[0] == 0..1 dest[1] == 2..5 returning 2.

Calling this with dest.Length == 3, you'd get: dest[0] == 0..1 dest[1] == 2..3 dest[2] == 4..5 returning 3.

Calling this with dest.Length == 4, you'd get: dest[0] == 0..1 dest[1] == 2..3 dest[2] == 4..5 dest[3] == unwritten returning 3.

Basically exactly the same semantics as string.Split where count == dest.Length. This should help make it easier for code to switch from string.Split.

iSazonov commented 1 year ago

Thanks for clarify! Look very good. This immediately triggers the need for #934 in cases where we do not know how many split values are expected.

GrabYourPitchforks commented 1 year ago

I have concerns about the examples called out in the original message. In both example (1) and example (2), I find the "after" code to be less readable than the "before" code.

(And in fact the "after" code for example (1) could stack overflow the process unless the stackalloc is moved to the top of the method, which means the declaration is significantly separated from its usage.)

This might just be due to the mental shift from needing to stackalloc N + 1 entries if you're trying to match exactly N elements. This pattern doesn't really match anything else in .NET and seems like it presents opportunity for callers to introduce bugs into their own code.

Do we need an analyzer or something else here to help callers avoid bugs?

stephentoub commented 1 year ago

This might just be due to the mental shift from needing to stackalloc N + 1 entries if you're trying to match exactly N elements. This pattern doesn't really match anything else in .NET and seems like it presents opportunity for callers to introduce bugs into their own code.

This ends up mapping exactly to the semantics of the existing string.Split when you provide a count. In fact, in my local branch for this, I just augmented all of the existing string.Split tests to also test the new method. So anywhere you provide a count to string.Split today, if you pass in new Range[count], that array will be filled in with ranges for exactly the same subsets as are returned in the new string[] array with the method today. If you pass in N+1 to string.Split today, you'll get back a string[] with Length == N, and with the new method, count == N.

I find the "after" code to be less readable than the "before" code.

It may be, but it also allocates much less. What would you recommend instead? This is as close as I could come to what we have today with string.Split while still being allocation-free and achieving all of the same scenarios where the expected number of parts is known in advance (for cases where there's no upper bound known on the number of parts, we can use an enumeration model as outlined in the other issue, but that's very difficult to work with for the common scenarios where you do know the number of parts in advance, want to validate the number of parts parsed, etc.)

GrabYourPitchforks commented 1 year ago

What would you recommend instead?

I don't have a general-purpose recommendation appropriate for BCL, unfortunately. Like many others, I wrote my own helpers for use within my own projects, but they're opinionated. For example, here's a sketch of a helper I have which allows writing (var a, var b, ...) = theSpan.Split(','); and matching exactly the number of elements you request.

static SplitHelper MemoryExtensions.Split(this ROS<char> text, char separator);
ref struct SplitHelper
{
    [EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b);
    [EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b, out ROS<char> c);
    [EB(Never)] void Deconstruct(out ROS<char> a, out ROS<char> b, out ROS<char> c, out ROS<char> d);
    // ...
}

Since you're trying to support considerably more scenarios (variable number of matches, various failure conditions, etc.) these opinionated helpers won't really help.

But back to the issue at hand, we do have several APIs which take a destination buffer as an argument and populate that buffer to the best of its ability, returning an error if the total number of elements would overflow the buffer. Many of the APIs on MemoryExtensions do just that. My concern is that people might say to themselves "oh, this takes a destination buffer as a parameter, and I'm familiar with that existing pattern!" - expecting the API to fail on overrun, since that's the pattern most closely matched by the proposed API surface.

But the proposed API actually matches the behavior of string.Split, where tail elements are collapsed into the final returned element. This behavior makes sense for people accustomed to string.Split, but not for people accustomed to all the rest of the MemoryExtensions surface area.

This mismatch strikes me as something that will confuse our consumers and introduce bugs into their apps. But unfortunately I don't have any good suggestions for you. Maybe the best answer really is just to ship it and ensure the docs are air-tight.

stephentoub commented 1 year ago

Thanks.

Maybe the best answer really is just to ship it and ensure the docs are air-tight.

👍

iSazonov commented 1 year ago

This mismatch strikes me as something that will confuse our consumers and introduce bugs into their apps.

My first thoughts were about Try...() patterns too but description of the proposal explicitly explains that it is overhead.

To reduce the chance of misunderstanding we could (1) make this completely consistent with #934, (2) by finding more informative names for methods and parameters.

For example, using "Range" in "SplitAsRange" looks superfluous since this information is already in the result type. Perhaps there is a way to briefly say that we are getting exactly the split we are asking for? Maybe SplitExactly?

terrajobst commented 1 year ago

Video

Let's drop the AsRanges() suffix

namespace System;

public static class MemoryExtensions
{
    public static int Split(this ReadOnlySpan<char> source, Span<Range> destination, char separator, StringSplitOptions options = StringSplitOptions.None);
    public static int Split(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separator, StringSplitOptions options = StringSplitOptions.None);
    public static int SplitAny(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separators, StringSplitOptions options = StringSplitOptions.None);
    public static int SplitAny(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<string> separators, StringSplitOptions options = StringSplitOptions.None);
}

davidfowl commented 1 year ago

Is there any reason we didn't consider byte overloads? Our utf8 plans have shifted from a separate type to just embracing byte in more APIs, feels like we should also add ReadOnlySpan<byte> overloads as well?

stephentoub commented 1 year ago

We discussed it. General conclusion was we'd be fine adding such overloads subsequently if ASP.NET or others have demonstrated places where it'd be used and we'd include something in the signature (method name, parameter names, Encoding parameter, whatever) that indicated we're dealing with UTF8 text (as opposed to, for example, just having a generic overload that happened to work with bytes).

iSazonov commented 1 year ago

Make sense to consider using IndexOfAnyValues in the APIs?

stephentoub commented 1 year ago

Make sense to consider using IndexOfAnyValues in the APIs?

I don't see how, but there's already a PR up; feel free to comment on if you think there's a missed opportunity.

iSazonov commented 1 year ago

~~Hmm, if classic Split now uses IndexOfAnyValues to cache probabilistic map why the API cannot?~~

stephentoub commented 1 year ago

if classic Split now uses IndexOfAnyValues to cache probabilistic map

It doesn't. And in the PR you'll see the APIs are calling into the exact same functions that string.Split does to find the separators.

KrzysztofCwalina commented 1 year ago

It would be nice if the API made it easy to tell if there is a possibility of more ranges or the API walked to the end of the source. When the API returns destination.Length, it seems to be difficult to tell whether the walk is done or not. i.e. should the API return OperationStatus?

stephentoub commented 1 year ago

When the API returns destination.Length, it seems to be difficult to tell whether the walk is done or not.

This is why for that use case the intent is to pass in a size one larger than the number of ranges you care about. That way, if it returns the desired number of ranges, you know there wasn't anything extraneous. You can see several examples of this in the PR providing the implementation here: https://github.com/dotnet/runtime/pull/79048/files

KrzysztofCwalina commented 1 year ago

I am talking about a situation when I don't know how many ranges I care about. I want to process all segments and I don't know how many segments there are. With the current APIs, I have to write something like the following

while(true) { 
    var count = source.Split(destination, ',');
    for (int i=0; i < count; i++)
    {
        Console.WriteLine(Encoding.UTF8.GetString(source[destination[i]]));
    }
    if (count != size) {
        break;
    }
    source = source.Slice(destination[size - 1].End.Value + 1);
}

with OperationStatus-returning APIs:

while(true) { 
    var status = source.Split(destination, ',', out int written, out int read);
    for (int i = 0; i < written; i++)
    {
        Console.WriteLine(Encoding.UTF8.GetString(source[destination[i]]));
    }
    if (status == OperationStatus.Done) break;
    source = source.Slice(read);
}

Having said that, I think the main help I had in mind is in the read out parameter. The source.Slice(destination[size - 1].End.Value + 1) slice statement took me some time to figure out.

stephentoub commented 1 year ago

I am talking about a situation when I don't know how many ranges I care about.

This API isn't for that. https://github.com/dotnet/runtime/issues/934 is for that. Your example with the proposed API there then would be more like:

foreach (Range r in source.Split(','))
{
    Console.WriteLine(Encoding.UTF8.GetString(source[r]));
}

terrajobst commented 1 year ago

Do we have an API to fix @KrzysztofCwalina bracing style? 😏

KrzysztofCwalina commented 1 year ago

@terrajobst, I am just following the guidelines in FDG ed 2 page 364 section A.1.1. Should I not follow these? I not, then how do I [cherry] pick which guidelines to follow? :-)

terrajobst commented 1 year ago

Do those guidelines use different bracing styles for for-loops and while-loops? 😱😳

KrzysztofCwalina commented 1 year ago

Quoting the last guideline form page 364: "Do place the closing brace on its own line unless followed by an else, else if, or while statement." ... so the answer is yes.

dotnet / runtime