Closed ahsonkhan closed 3 months ago
Presumed usage would be like this?
var splitter = span.Split(',', StringSplitOptions.RemoveEmptyEntries);
while (splitter.TryMoveNext(out var slice))
{
//....
}
Yeah, looks good.
Would it be possible to implement the enumerable pattern, even if it's not possible to implement IEnumerable<T>
?
I think the interface would look something like:
public static SpanSplitEnumerable Split(this ReadOnlySpan<char> span, char seperator);
public ref struct SpanSplitEnumerable
{
public SpanSplitEnumerator GetEnumerator();
}
public ref struct SpanSplitEnumerator
{
public bool MoveNext();
public ReadOnlySpan<char> Current { get; }
}
That way, the usage could be:
foreach (var slice in span.Split(','))
{
…
}
If indexing is important, something like ref struct SpanSplitList
would probably work, but I think it couldn't guarantee zero allocations.
Sorry, didn't see this issue before posting https://github.com/dotnet/corefx/issues/21395#issuecomment-359802015.
As mentioned, I'd love to see something like Google Guava's CharMatcher
for this. It could be useful for other scenarios as well.
It would be great if the split worked also with Span
It would be great if the split worked also with Span
and returned ReadOnlyMemory<ReadOnlyMemory >, ReadOnlySpan<ReadOnlyMemory >
Clarifying the comment so the generic types show up:
It would be great if the split worked also with Span<byte>
and returned ReadOnlyMemory<ReadOnlyMemory<T>>
, ReadOnlySpan<ReadOnlyMemory<T>>
Simple implementation for SpanSplitEnumerable<T> Split<T>(this ReadOnlySpan<T> span, T separator)
interface where the return type is foreachable. Unit tests included.
https://gist.github.com/LordJZ/92b7decebe52178a445a0b82f63e585a
The sentinel is a "clever trick" to avoid having a boolean field, can be replaced with a backing field.
We needed features like this on top of ASP.NET's StringTokenizer
type in the aspnet/WebHooks repo. See that repo's TrimmingTokenizer
and have seen some interest (aspnet/WebHooks#324) in making that more broadly available.
It would be great to point customers to something in the BCL instead of (say) adding TrimmingTokenizer
features to StringTokenizer
. What is the expected timeframe for the "Future" milestone that would include a dotnet/corefx#26528 fix?
/cc @davidfowl
@ahsonkhan if this proposal makes sense to you, perhaps you could help shepherd it to review, as it seems there would be volunteers to implement it?
cc @JeremyKuhne who has been doing thinking about low allocation string operations.
StringBuilder added an enumerator for the next version that we should consider when deciding on a pattern for enumerating spans. https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Text/StringBuilder.cs#L587
if this proposal makes sense to you, perhaps you could help shepherd it to review, as it seems there would be volunteers to implement it?
Let me make sure the API shape is clear. Incorporating the recent feedback (to get foreach support, make it generic), here is the API:
public static SpanSplitEnumerable<T> Split(this ReadOnlySpan<T> span, T seperator,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public ref struct SpanSplitEnumerable<T> where T : IEquatable<T>
{
public SpanSplitEnumerator<T> GetEnumerator();
}
public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
public bool MoveNext();
public ReadOnlySpan<T> Current { get; }
}
@KrzysztofCwalina this is the enumerating spans issue we discussed yesterday.
Does StringSplitOptions
really make sense for a generic T
?
Does
StringSplitOptions
really make sense for a genericT
?
I suppose a SpanSplitOptions
which would mimic StringSplitOptions
makes more sense here.
Should the enumerator and the enumerable be combined? i.e. what's the reason to have two types?
@KrzysztofCwalina I think it makes the API cleaner and easier to understand. The foreach
pattern requires an enumerable and an enumerator, and that's exactly what the proposed API provides.
It there any advantage in combining them, apart from decreasing the size of the API surface?
We don't think we want these APIs:
String.Split
If you care about allocations, then you want a different API. And if you don't care about allocations, well, then you can just use String.Split
.
So what API would we like to see? It could be a struct-based enumerator that allows the consumer to foreach
the individual spans without allocations. Alternatively (or additionally) we could have a method that allows the consumer to pass in a buffer with the locations, which, for the most part, could be stackalloc
ated.
We don't think we want these APIs:
1. They allocate
I readily concur with this. If I don't care about allocations, I'll simply be using string.Split
. If I care enough to want to avoid allocating substrings, then obviously I would like to avoid any allocations whatsoever.
Suggest ReadOnlySpan<T>.Split
returns ReadOnlySpan<ReadOnlySpan<T>>
on the stack so there is no more allocation.
If I want to use foreach
, which allocations, then I can still enumerate over the ReadOnlySpan<ReadOnlySpan<T>>
. Otherwise, I'll use a simple for
loop to have zero allocation.
Can you use ReadOnlySpan<T>
as a generic argument?
@schungx
Suggest
ReadOnlySpan<T>.Split
returnsReadOnlySpan<ReadOnlySpan<T>>
on the stack so there is no more allocation.
Split
can't return a Span
that was stack allocated by itself. It could take that Span
as an additional parameter, but then you have to somehow handle the case where that Span
is not big enough. I believe this is what @terrajobst meant by "we could have a method that allows the consumer to pass in a buffer with the locations".
@terrajobst, the API that @svick proposed above does not seem to allocate. Am I missing something? Or are you saying that we want both convenience and no allocations?
@svick, I wonder how important it is for this enumerator to be easy to understand. The classic enumerator pattern has two types for many reasons that are less and less applicable once you deal with by ref structs that cannot implement enumerator interfaces and get copied when passed around. I do agree we should discuss pros and cons and maybe indeed it's better to have two types, but I wanted to open the discussion as it seems such an overkill to add two by ref types just so we can split.
@KrzysztofCwalina I agree that my argument of "this version of the API is slightly easier to understand" is fairly weak in this case. But in my opinion, the argument of "this version of the API is slightly smaller" is even weaker, which is why I prefer to have two types.
But ultimately it doesn't make that much of a difference, as long as I'll be able to have foreach (var slice in span.Split(','))
without allocations, I'll be happy.
@svick I'm suspect a normal foreach
will always allocate because the current state must be kept somewhere, even if you somehow come up with value-based iterators...
And what is the purpose of this again? That's to prevent allocations in parsing code in tight loops, where all we're doing is manipulating streams of text without ever changing them.
@schungx There's nothing really new to come up with. For example, foreach
over a List<T>
already does not allocate on the heap, because the current state is kept in the enumerator struct
on the stack. A very similar approach can be used here.
- They allocate
It could be a struct-based enumerator that allows the consumer to
foreach
the individual spans without allocations.Should the enumerator and the enumerable be combined? i.e. what's the reason to have two types?
The APIs, as suggested here, don't allocate and are essentially what was recommended. Maybe it was missed since it wasn't at the top. Will copy it over to the top post.
Making it into a single type:
public static SpanSplitEnumerator<T> Split(this ReadOnlySpan<T> span, T seperator,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
public SpanSplitEnumerator GetEnumerator() { return this; }
public bool MoveNext();
public ReadOnlySpan<T> Current { get; }
}
char
, so leaving it as T
is fine.Span<T>
, we should make sure we can
SpanSplitEnumerator
to ReadOnlySpanSplitEnumerator
to MemoryExtensions
StringSplitOptions
might feel odd, but (1) we've never extended it and (2) and the operations are about how the split is performed, which applies to this API tochar[]
(matches any) we'd add a new method called SplitAny
so that we can also have the equivalent of string
(matches the sequence)Are we going to have an overload that takes ReadOnlySpan
I tried implementing the APIs above, and additionally a two more split methods in my personal repository. The types and methods themselves can be found in ConsoleApp4
(lol) folder, and the tests can be found in SplitTests
folder.
public static ReadOnlySpanSplitEnumerator<T> Split<T>(this ReadOnlySpan<T> span, T delimiter,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public static ReadOnlySpanSplitBySequenceEnumerator<T> Split<T>(this ReadOnlySpan<T> span,
ReadOnlySpan<T> delimiter, StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public static ReadOnlySpanSplitByAnyEnumerator<T> SplitByAny<T>(this ReadOnlySpan<T> span,
ReadOnlySpan<T> delimiters, StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
The first one replaces string.Split(char)
, the second one replaces string.Split(string)
, and the third replaces string.Split(char[])
. The last two are not really included in the original discussion (and would take a bit of time to go through approval process even if it does), but I personally think it is worth including to have the complementary alternatives as well.
It's not a great code perhaps, but I'd love to open a PR for this issue (with the code above). Although the last two methods are not approved, at least I can open one for the regular T
version.
A slightly different take on this. Considering that Range
is now available, and the non-zero cost of Slice
on potentially-many results, what if the enumerable
returned Range
s instead of sliced RoS
. In other words, a list of delineations. The consumer could then decide when/if to Slice(range)
foreach (var range in span.Split(','))
{
var x = span.Slice(range);
}
Perhaps that's too abstract for a public api.
Did some experimenting with this as well. Not sure if it's an optimal solution though. https://github.com/bbartels/coreclr/blob/master/src/System.Private.CoreLib/shared/System/MemoryExtensions.Split.cs
@grant-d I like that API
That seems reasonable to me.
Only problem is that it would conflict with the approved api, as only the return type differs.
Correct; it would need to go back into api triage
Returning ranges has some real advantages, namely we can operate on a ReadOnlySpan<T>
but then apply the results to other types, in particular Span<T>
, Memory<T>
, and ReadOnlyMemory<T>
, e.g.
foreach (Range r in memory.Span.Split(':')) Process(memory[r]);
Correct; it would need to go back into api triage
To clarify, can you share the proposed range based API shape and then we can reconcile it/review it? I can update the original post if you provide the precise shape, and we can re-review.
I suppose it's just substituting Range for ReadOnlySpan\<T> in the Current property.
I don't think overloading Slice would work as stephen suggested (Probably just a typo). A ReadOnlySpan
So something like:
public static SpanSplitEnumerator<T> Split(this ReadOnlySpan<T> span, T seperator,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
public SpanSplitEnumerator<T> GetEnumerator() { return this; }
public bool MoveNext();
public Range Current { get; }
}
I don't think overloading Slice would work as stephen suggested (Probably just a typo)
I didn't suggest overloading Slice. If a type has a Slice(int, int)
method, the C# 8 compiler now supports passing a Range
as an indexer on the type and the compiler expands it into a call to the Slice(int, int)
method.
I didn't suggest overloading Slice.
memory.Span.Slice(':')
I believe this is what's causing the confusion.
Returning ranges has some real advantages, namely we can operate on a
ReadOnlySpan<T>
but then apply the results to other types, in particularSpan<T>
,Memory<T>
, andReadOnlyMemory<T>
, e.g.foreach (Range r in memory.Span.Slice(':')) Process(memory[r]);
I meant in the foreach, you used memory.Span.Slice(':').
Ah, sorry, I meant Split, not Slice.
Maybe it would make sense to add another enumerator, which would allow for splitting by a sequence. Just like string.Split("??") is possible.
public static SpanSplitEnumerator<T> Split(this ReadOnlySpan<T> span, T seperator,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public static SpanSplitSequenceEnumerator<T> Split(this ReadOnlySpan<T> span, ReadOnlySpan<T> seperators,
StringSplitOptions options = StringSplitOptions.None) where T : IEquatable<T> {}
public ref struct SpanSplitEnumerator<T> where T : IEquatable<T>
{
public SpanSplitEnumerator<T> GetEnumerator() { return this; }
public bool MoveNext();
public Range Current { get; }
}
public ref struct SpanSplitSequenceEnumerator<T> where T : IEquatable<T>
{
public SpanSplitSequenceEnumerator<T> GetEnumerator() { return this; }
public bool MoveNext();
public Range Current { get; }
}
Could also be merged into one enumerator that just holds an ReadOnlySpan
A split api that returns ranges is pure genius @grant-d. That would be a really nice api to aim for.
Sad to see we haven't made progress on this!
Can I suggest allowing to pass a delegate bool IsSeparatorPredicate(ReadOnlySpan<T> span)
?
This will enable splitting on surrogate pairs etc.
@scalablecory, how about Rune
? (and ReadOnlySpan of Rune)
Rune
doesn't solve grapheme clusters, and dealing with normalization would become unwieldy. A predicate would handle all cases.
Isn't one of the main points within the Span
Can I suggest allowing to pass a delegate bool IsSeparatorPredicate(ReadOnlySpan
span)?
And turn split into an O(N^2) algorithm?
I think this should be kept relatively simple, making the common case easy.
Can I suggest allowing to pass a delegate bool IsSeparatorPredicate(ReadOnlySpan span)?
And turn split into an O(N^2) algorithm?
I think this should be kept relatively simple, making the common case easy.
I don't see how this would be any different than String.Split(string[])
Edited by @stephentoub on 6/26/2024:
Older proposals:
Previously approved API Proposal
Split off from https://github.com/dotnet/corefx/issues/21395#issuecomment-359342832
From @Joe4evr on January 21, 2018 23:16
From @ahsonkhan on January 22, 2018 01:36
From @Joe4evr on January 22, 2018 08:35
From @stephentoub on January 22, 2018 08:41
cc @KrzysztofCwalina, @stephentoub, @Joe4evr