dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.25k stars 4.73k forks source link

Add 'split' support for ReadOnlySpan<char> similar to string #934

Closed ahsonkhan closed 3 months ago

ahsonkhan commented 6 years ago

Edited by @stephentoub on 6/26/2024:

public static class MemoryExtensions
{
    // Alternative name: EnumerateSplits, but not sure what SplitAny would be called
+   public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> source, T separator);
+   public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> source, ReadOnlySpan<T> separator);
+   public static SpanSplitEnumerator<T> SplitAny<T>(this ReadOnlySpan<T> source, params ReadOnlySpan<T> separators);

    // Optional:
+   public static SpanSplitEnumerator<char> SplitAny(this ReadOnlySpan<char> source, params ReadOnlySpan<string> separators);

+   public ref struct SpanSplitEnumerator<T>
+   {
+       public StringSplitEnumerator<T> GetEnumerator();
+       public bool MoveNext();
+       public Range Current { get; }
+   }
}

Older proposals:

partial class MemoryExtensions
{
    public static SpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> span, T separator,
        StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

    public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
    {
        public SpanSplitEnumerator<T> GetEnumerator() { return this;  }
        public bool MoveNext();
        public Range Current { get; }
    }
}

Previously approved API Proposal

public static SpanSplitEnumerator<T> Split(this ReadOnlySpan<T> span, T seperator,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
    public SpanSplitEnumerator GetEnumerator() { return this;  }
    public bool MoveNext();
    public ReadOnlySpan<T> Current { get; }
}

Split off from https://github.com/dotnet/corefx/issues/21395#issuecomment-359342832

From @Joe4evr on January 21, 2018 23:16

Can I throw in another suggestion? I'd really like to see some ability to split a ReadOnlySpan<char>. Obviously, you can't return a collection of Spans directly, but isn't that what Memory<T> is for?

// equivalent to the overloads of 'String.Split()'
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, char[] seperator);
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, char[] seperator, int count);
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, char[] seperator, int count, StringSplitOptions options);
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, char[] seperator, StringSplitOptions options);
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, string[] seperator, int count, StringSplitOptions options);
public static IReadOnlyList<ReadOnlyMemory<char>> Split(this ReadOnlySpan<char> span, string[] seperator, StringSplitOptions options);

The reason for choosing IReadOnlyList<T> is to make the resulting collection indexable, just like string[]. It would be nice if the implementation is ImmutableArray<T>, but I'm not sure if that's a concern for distribution and such.

From @ahsonkhan on January 22, 2018 01:36

@Joe4evr, out of curiosity, do you have a scenario atm where these APIs would be useful? If so, can you please show the code sample?

I would replace the char[] overloads with ReadOnlySpan\<char>. However, I am not sure about adding the split APIs, in general, given they have to allocate. Is there a way to avoid allocating? Also, given these are string-like APIs for span, it is strange to have an overload that takes a string[]. Maybe all these would fit better on ReadOnlyMemory instead, especially given the return type.

From @Joe4evr on January 22, 2018 08:35

My scenario is taking a relatively big string of user input and then parsing that to populate a more complex object. So rather than take the whole string at once, I'd like to parse it in pieces at a time. It'd be pretty nice if this can be facilitated by the Span/Memory<T> APIs so that this code won't have to allocate an extra 30-40 tiny strings whenever it runs.

Admittedly, I only started on this particular case earlier today, mostly to experiment and find out how much I could get out of the Span APIs at this time.

Maybe it was a bit naive of me to expect a collection like I did, but I'd at least like to see some API to deal with this scenario a little easier, because I'll probably not be the only one looking to split a span up into smaller chunks like this.

From @stephentoub on January 22, 2018 08:41

Splitting support would be good, but I don't think it would look like the proposed methods; as @ahsonkhan points out, that would result in a lot of allocation (including needing to copy the whole input string to the heap, since you can't store the span into a returned interface implementation).

I would instead expect a design more like an iterator implemented as a ref struct, e.g.

public ref struct CharSpanSplitter
{
public CharSpanSplitter(ReadOnlySpan<char> value, char separator, StringSplitOptions options);
public bool TryMoveNext(out ReadOnlySpan<char> result);
}

cc @KrzysztofCwalina, @stephentoub, @Joe4evr

Joe4evr commented 6 years ago

Presumed usage would be like this?

var splitter = span.Split(',', StringSplitOptions.RemoveEmptyEntries);
while (splitter.TryMoveNext(out var slice))
{
    //....
}

Yeah, looks good.

svick commented 6 years ago

Would it be possible to implement the enumerable pattern, even if it's not possible to implement IEnumerable<T>?

I think the interface would look something like:

public static SpanSplitEnumerable Split(this ReadOnlySpan<char> span, char seperator);

public ref struct SpanSplitEnumerable
{
    public SpanSplitEnumerator GetEnumerator();
}

public ref struct SpanSplitEnumerator
{
    public bool MoveNext();
    public ReadOnlySpan<char> Current { get; }
}

That way, the usage could be:

foreach (var slice in span.Split(','))
{
    …
}

If indexing is important, something like ref struct SpanSplitList would probably work, but I think it couldn't guarantee zero allocations.

khellang commented 6 years ago

Sorry, didn't see this issue before posting https://github.com/dotnet/corefx/issues/21395#issuecomment-359802015.

As mentioned, I'd love to see something like Google Guava's CharMatcher for this. It could be useful for other scenarios as well.

grahamehorner commented 6 years ago

It would be great if the split worked also with Span and returned ReadOnlyMemory<ReadOnlyMemory>, ReadOnlySpan<ReadOnlyMemory>

ahsonkhan commented 6 years ago

It would be great if the split worked also with Span and returned ReadOnlyMemory<ReadOnlyMemory>, ReadOnlySpan<ReadOnlyMemory>

Clarifying the comment so the generic types show up: It would be great if the split worked also with Span<byte> and returned ReadOnlyMemory<ReadOnlyMemory<T>>, ReadOnlySpan<ReadOnlyMemory<T>>

LordJZ commented 6 years ago

Simple implementation for SpanSplitEnumerable<T> Split<T>(this ReadOnlySpan<T> span, T separator) interface where the return type is foreachable. Unit tests included.

https://gist.github.com/LordJZ/92b7decebe52178a445a0b82f63e585a

The sentinel is a "clever trick" to avoid having a boolean field, can be replaced with a backing field.

dougbu commented 6 years ago

We needed features like this on top of ASP.NET's StringTokenizer type in the aspnet/WebHooks repo. See that repo's TrimmingTokenizer and have seen some interest (aspnet/WebHooks#324) in making that more broadly available.

It would be great to point customers to something in the BCL instead of (say) adding TrimmingTokenizer features to StringTokenizer. What is the expected timeframe for the "Future" milestone that would include a dotnet/corefx#26528 fix?

/cc @davidfowl

danmoseley commented 6 years ago

@ahsonkhan if this proposal makes sense to you, perhaps you could help shepherd it to review, as it seems there would be volunteers to implement it?

danmoseley commented 6 years ago

cc @JeremyKuhne who has been doing thinking about low allocation string operations.

JeremyKuhne commented 6 years ago

StringBuilder added an enumerator for the next version that we should consider when deciding on a pattern for enumerating spans. https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs#L587

ahsonkhan commented 6 years ago

if this proposal makes sense to you, perhaps you could help shepherd it to review, as it seems there would be volunteers to implement it?

Let me make sure the API shape is clear. Incorporating the recent feedback (to get foreach support, make it generic), here is the API:

public static SpanSplitEnumerable<T> Split(this ReadOnlySpan<T> span, T seperator,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

public ref struct SpanSplitEnumerable<T> where T : IEquatable<T>
{
    public SpanSplitEnumerator<T> GetEnumerator();
}

public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
    public bool MoveNext();
    public ReadOnlySpan<T> Current { get; }
}
JeremyKuhne commented 6 years ago

@KrzysztofCwalina this is the enumerating spans issue we discussed yesterday.

khellang commented 6 years ago

Does StringSplitOptions really make sense for a generic T?

candoumbe commented 6 years ago

Does StringSplitOptions really make sense for a generic T?

I suppose a SpanSplitOptions which would mimic StringSplitOptions makes more sense here.

KrzysztofCwalina commented 6 years ago

Should the enumerator and the enumerable be combined? i.e. what's the reason to have two types?

svick commented 6 years ago

@KrzysztofCwalina I think it makes the API cleaner and easier to understand. The foreach pattern requires an enumerable and an enumerator, and that's exactly what the proposed API provides.

It there any advantage in combining them, apart from decreasing the size of the API surface?

terrajobst commented 6 years ago

Video

We don't think we want these APIs:

  1. They allocate
  2. They aren't as convenient as String.Split

If you care about allocations, then you want a different API. And if you don't care about allocations, well, then you can just use String.Split.

So what API would we like to see? It could be a struct-based enumerator that allows the consumer to foreach the individual spans without allocations. Alternatively (or additionally) we could have a method that allows the consumer to pass in a buffer with the locations, which, for the most part, could be stackallocated.

schungx commented 6 years ago

We don't think we want these APIs:

1. They allocate

I readily concur with this. If I don't care about allocations, I'll simply be using string.Split. If I care enough to want to avoid allocating substrings, then obviously I would like to avoid any allocations whatsoever.

Suggest ReadOnlySpan<T>.Split returns ReadOnlySpan<ReadOnlySpan<T>> on the stack so there is no more allocation.

If I want to use foreach, which allocations, then I can still enumerate over the ReadOnlySpan<ReadOnlySpan<T>>. Otherwise, I'll use a simple for loop to have zero allocation.

khellang commented 6 years ago

Can you use ReadOnlySpan<T> as a generic argument?

svick commented 6 years ago

@schungx

Suggest ReadOnlySpan<T>.Split returns ReadOnlySpan<ReadOnlySpan<T>> on the stack so there is no more allocation.

Split can't return a Span that was stack allocated by itself. It could take that Span as an additional parameter, but then you have to somehow handle the case where that Span is not big enough. I believe this is what @terrajobst meant by "we could have a method that allows the consumer to pass in a buffer with the locations".

KrzysztofCwalina commented 6 years ago

@terrajobst, the API that @svick proposed above does not seem to allocate. Am I missing something? Or are you saying that we want both convenience and no allocations?

@svick, I wonder how important it is for this enumerator to be easy to understand. The classic enumerator pattern has two types for many reasons that are less and less applicable once you deal with by ref structs that cannot implement enumerator interfaces and get copied when passed around. I do agree we should discuss pros and cons and maybe indeed it's better to have two types, but I wanted to open the discussion as it seems such an overkill to add two by ref types just so we can split.

svick commented 6 years ago

@KrzysztofCwalina I agree that my argument of "this version of the API is slightly easier to understand" is fairly weak in this case. But in my opinion, the argument of "this version of the API is slightly smaller" is even weaker, which is why I prefer to have two types.

But ultimately it doesn't make that much of a difference, as long as I'll be able to have foreach (var slice in span.Split(',')) without allocations, I'll be happy.

schungx commented 6 years ago

@svick I'm suspect a normal foreach will always allocate because the current state must be kept somewhere, even if you somehow come up with value-based iterators...

And what is the purpose of this again? That's to prevent allocations in parsing code in tight loops, where all we're doing is manipulating streams of text without ever changing them.

svick commented 6 years ago

@schungx There's nothing really new to come up with. For example, foreach over a List<T> already does not allocate on the heap, because the current state is kept in the enumerator struct on the stack. A very similar approach can be used here.

ahsonkhan commented 5 years ago
  1. They allocate

It could be a struct-based enumerator that allows the consumer to foreach the individual spans without allocations.

Should the enumerator and the enumerable be combined? i.e. what's the reason to have two types?

The APIs, as suggested here, don't allocate and are essentially what was recommended. Maybe it was missed since it wasn't at the top. Will copy it over to the top post.

Making it into a single type:

public static SpanSplitEnumerator<T> Split(this ReadOnlySpan<T> span, T seperator,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
    public SpanSplitEnumerator GetEnumerator() { return this;  }
    public bool MoveNext();
    public ReadOnlySpan<T> Current { get; }
}
terrajobst commented 5 years ago
Gnbrkm41 commented 5 years ago

Are we going to have an overload that takes ReadOnlySpan as the delimiter as well?

Gnbrkm41 commented 5 years ago

I tried implementing the APIs above, and additionally a two more split methods in my personal repository. The types and methods themselves can be found in ConsoleApp4 (lol) folder, and the tests can be found in SplitTests folder.

public static ReadOnlySpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> span, T delimiter,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public static ReadOnlySpanSplitBySequenceEnumerator<T> Split<T>(this ReadOnlySpan<T> span,
    ReadOnlySpan<T> delimiter, StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public static ReadOnlySpanSplitByAnyEnumerator<T> SplitByAny<T>(this ReadOnlySpan<T> span,
    ReadOnlySpan<T> delimiters, StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

The first one replaces string.Split(char), the second one replaces string.Split(string), and the third replaces string.Split(char[]). The last two are not really included in the original discussion (and would take a bit of time to go through approval process even if it does), but I personally think it is worth including to have the complementary alternatives as well.

It's not a great code perhaps, but I'd love to open a PR for this issue (with the code above). Although the last two methods are not approved, at least I can open one for the regular T version.

grant-d commented 5 years ago

A slightly different take on this. Considering that Range is now available, and the non-zero cost of Slice on potentially-many results, what if the enumerable returned Ranges instead of sliced RoS. In other words, a list of delineations. The consumer could then decide when/if to Slice(range)

foreach (var range in span.Split(','))
{
    var x = span.Slice(range);
}

Perhaps that's too abstract for a public api.

bbartels commented 5 years ago

Did some experimenting with this as well. Not sure if it's an optimal solution though. https://github.com/bbartels/coreclr/blob/master/src/System.Private.CoreLib/shared/System/MemoryExtensions.Split.cs

davidfowl commented 5 years ago

@grant-d I like that API

stephentoub commented 5 years ago

That seems reasonable to me.

bbartels commented 5 years ago

Only problem is that it would conflict with the approved api, as only the return type differs.

grant-d commented 5 years ago

Correct; it would need to go back into api triage

stephentoub commented 5 years ago

Returning ranges has some real advantages, namely we can operate on a ReadOnlySpan<T> but then apply the results to other types, in particular Span<T>, Memory<T>, and ReadOnlyMemory<T>, e.g.

foreach (Range r in memory.Span.Split(':')) Process(memory[r]);
ahsonkhan commented 5 years ago

Correct; it would need to go back into api triage

To clarify, can you share the proposed range based API shape and then we can reconcile it/review it? I can update the original post if you provide the precise shape, and we can re-review.

bbartels commented 5 years ago

I suppose it's just substituting Range for ReadOnlySpan\<T> in the Current property. I don't think overloading Slice would work as stephen suggested (Probably just a typo). A ReadOnlySpan.Slice call would be ambiguous, as it already has an overload that takes an int.

So something like:

public static SpanSplitEnumerator<T> Split(this ReadOnlySpan<T> span, T seperator,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
    public SpanSplitEnumerator<T> GetEnumerator() { return this;  }
    public bool MoveNext();
    public Range Current { get; }
}
stephentoub commented 5 years ago

I don't think overloading Slice would work as stephen suggested (Probably just a typo)

I didn't suggest overloading Slice. If a type has a Slice(int, int) method, the C# 8 compiler now supports passing a Range as an indexer on the type and the compiler expands it into a call to the Slice(int, int) method.

ahsonkhan commented 5 years ago

I didn't suggest overloading Slice.

memory.Span.Slice(':')

I believe this is what's causing the confusion.

bbartels commented 5 years ago

Returning ranges has some real advantages, namely we can operate on a ReadOnlySpan<T> but then apply the results to other types, in particular Span<T>, Memory<T>, and ReadOnlyMemory<T>, e.g.

foreach (Range r in memory.Span.Slice(':')) Process(memory[r]);

I meant in the foreach, you used memory.Span.Slice(':').

stephentoub commented 5 years ago

Ah, sorry, I meant Split, not Slice.

bbartels commented 5 years ago

Maybe it would make sense to add another enumerator, which would allow for splitting by a sequence. Just like string.Split("??") is possible.

public static SpanSplitEnumerator<T> Split(this ReadOnlySpan<T> span, T seperator,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

public static SpanSplitSequenceEnumerator<T> Split(this ReadOnlySpan<T> span, ReadOnlySpan<T> seperators,
    StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}

public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
    public SpanSplitEnumerator<T> GetEnumerator() { return this;  }
    public bool MoveNext();
    public Range Current { get; }
}

public ref struct SpanSplitSequenceEnumerator<T> where T : IEquatable<T>
{
    public SpanSplitSequenceEnumerator<T> GetEnumerator() { return this;  }
    public bool MoveNext();
    public Range Current { get; }
}

Could also be merged into one enumerator that just holds an ReadOnlySpan for separators. Would be a little more space efficient when T is a large struct (> 12 bytes on 64 bit systems), but also a little less space efficient on smaller structures (which I reckon is the most common use-case with char). Maybe something to debate in triage.

NinoFloris commented 5 years ago

A split api that returns ranges is pure genius @grant-d. That would be a really nice api to aim for.

scalablecory commented 5 years ago

Sad to see we haven't made progress on this!

scalablecory commented 5 years ago

Can I suggest allowing to pass a delegate bool IsSeparatorPredicate(ReadOnlySpan<T> span)?

This will enable splitting on surrogate pairs etc.

Gnbrkm41 commented 5 years ago

@scalablecory, how about Rune? (and ReadOnlySpan of Rune)

scalablecory commented 5 years ago

Rune doesn't solve grapheme clusters, and dealing with normalization would become unwieldy. A predicate would handle all cases.

bbartels commented 5 years ago

Isn't one of the main points within the Span ecosystem to be allocation free? I feel like bringing in a delegate would just go against that mentality. Wouldn't allowing to pass a ReadOnlySpan as a separator solve the graphmere cluster problem just as well, without an allocation?

stephentoub commented 5 years ago

Can I suggest allowing to pass a delegate bool IsSeparatorPredicate(ReadOnlySpan span)?

And turn split into an O(N^2) algorithm?

I think this should be kept relatively simple, making the common case easy.

scalablecory commented 5 years ago

Can I suggest allowing to pass a delegate bool IsSeparatorPredicate(ReadOnlySpan span)?

And turn split into an O(N^2) algorithm?

I think this should be kept relatively simple, making the common case easy.

I don't see how this would be any different than String.Split(string[])