Open stephentoub opened 1 year ago
Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.
Author: | stephentoub |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Memory` |
Milestone: | 8.0.0 |
The proposal LGTM. I assume we are going to provide a vectorized implementation for the most frequent cases (Latin1
)?
I assume we are going to provide a vectorized implementation for the most frequent cases (Latin1)?
At least experiment with one. I think the methods would be worth adding even if not vectorized; I listed some examples of places code is doing this same operation, but there are many more... it's surprisingly common. That said, I think it'd be worth trying to vectorize it and see what the perf looks like, in particular when there's non-ASCII input. My initial thought was we'd basically vectorize the search for all the ASCII whitespace chars (0x9 through 0xD and 0x20) as well as anything > 0x7F, and if we find anything > 0x7F, then fall back to a scalar loop just using char.IsWhiteSpace (I don't know if there'd be an efficient way to vectorize the search for the Unicode whitespace characters). My main concern with this would be that if the input wasn't ASCII, we'd likely frequently be paying the overhead of an initial vector on top of then taking the scalar path.
IndexOfAnyValues
(at least for non-except overloads), but if we expect matches to commonly occur at the start of the input (e.g. Trim), we may not want to use it anyway due to the added overhead.Looks good as proposed.
When looking for consistency, we saw that the ReadOnlySpan<char>
methods on MemoryExtensions are not entirely consistent with whether there's also a Span<char>
overload. In the event that the Span<char>
members are desired, they're approved by pattern-matching.
Notably, MemoryExtensions.IsWhiteSpace does not have a Span<char>
overload.
namespace System;
public static partial class MemoryExtensions
{
public static int IndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
public static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
public static int LastIndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
public static int LastIndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
// Edit: APIs approved as part of #86528
public static bool ContainsAnyWhiteSpace(this ReadOnlySpan<char> span);
public static bool ContainsAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
}
As of #83992, the regex source generator will now emit its own variants of these, e.g. if it needs to search for \s
it'll emit:
/// <summary>Finds the next index of any character that matches a whitespace character.</summary>
internal static int IndexOfAnyWhiteSpace(this ReadOnlySpan<char> span)
{
int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptWhiteSpace);
if ((uint)i < (uint)span.Length)
{
if (span[i] <= 0x7f)
{
return i;
}
do
{
if (char.IsWhiteSpace(span[i]))
{
return i;
}
i++;
}
while ((uint)i < (uint)span.Length);
}
return -1;
}
/// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f".</summary>
internal static readonly IndexOfAnyValues<char> s_asciiExceptWhiteSpace = IndexOfAnyValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f");
and for \S
:
/// <summary>Finds the next index of any character that matches any character other than a space character.</summary>
internal static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span)
{
int i = span.IndexOfAnyExcept(Utilities.s_asciiWhiteSpace);
if ((uint)i < (uint)span.Length)
{
if (span[i] <= 0x7f)
{
return i;
}
do
{
if (!char.IsWhiteSpace(span[i]))
{
return i;
}
i++;
}
while ((uint)i < (uint)span.Length);
}
return -1;
}
/// <summary>Supports searching for characters in or not in "\t\n\v\f\r ".</summary>
internal static readonly IndexOfAnyValues<char> s_asciiWhiteSpace = IndexOfAnyValues.Create("\t\n\v\f\r ");
(It's capable now of emitting many variations of these... it just chooses good names for the methods for these known sets, but will generate a similarly shaped method for any set it doesn't have a better strategy for.)
I suggest we put this issue on hold until we get some more experience with those implementations and how they fair. Once we're happy with them, we can effectively promote them up to MemoryExtensions per this issue, and the regex source generator / compiler can be changed to use those instead.
A nit - the names IndexOfAnyWhiteSpace
and IndexOfAnyExceptWhiteSpace
read very strangely to me. I know we're establishing a pattern around Any and AnyExcept, but is it possible to tweak these names to make them a little more human? Breaking the naming pattern in this instance might help improve readability and understandability.
I'm assuming the respective definitions are "return the index of the first whitespace char you see; or -1 if you find no whitespace chars or the string is empty" and "return the index of the first non-whitespace char you see; or -1 if the string contains only whitespace chars or is empty."
Some proposals:
IndexOfWhiteSpace
and IndexOfNonWhiteSpace
(remove the "any", since the term whitespace already implies a set of possible matching characters rather than a single possible char match)IndexOfAnyWhiteSpace
and IndexOfAnyNonWhiteSpace
(keep the "any"; change the "except" to "non")The capitalization of space in WhiteSpace is jarring for me given we consistently write whitespace as a single word even in this very issue.
The capitalization of space in WhiteSpace is jarring for me given we consistently write whitespace as a single word even in this very issue.
I don't disagree, but we have char.IsWhiteSpace
, Rune.IsWhiteSpace
, string.IsNullOrWhiteSpace
, MemoryExtensions.IsWhiteSpace
, ArgumentException.ThrowIfNullOrWhiteSpace
, FromBase64TransformMode.IgnoreWhiteSpaces
, ... outside of System.Xml and one enum value in regex, we've pretty consistently used "WhiteSpace" rather than "Whitespace" and shouldn't change now.
It was contentious enough that Krzysztof wrote it in the table of correct spellings in Framework Design Guidelines 2nd edition (or maybe even first).
Pascal: WhiteSpace, Camel: whiteSpace, Not: Whitespace
Which is in opposition to Timestamp:
Pascal: Timestamp, Camel: timestamp, Not: TimeStamp
Background and motivation
It turns out it's quite common to want to search for whitespace (
char.IsWhiteSpace
) or things other than whitespace (!char.IsWhiteSpace
). This is true not only in regex (\s
and\S
) (according to our nuget regex database there are ~13,000 occurrences of a regex that's simply\s
) but also in many open-coded loops, e.g.Etc. We should expose these as dedicated helpers, whether or not we're able to improve performance over a simple loop (we might be able to, for at least some kinds of input).
API Proposal
ReadOnlySpan<char>
and not alsoSpan<char>
, since the most common case by far is expected to be spans derived from strings. The existing MemoryExtensions.IsWhiteSpace is also only exposed forReadOnlySpan<char>
.API Usage
e.g. MemoryExtensions.IsWhiteSpace could be rewritten as simply:
Alternative Designs
If we want to expose these but don't want them to be so prominent, once https://github.com/dotnet/runtime/issues/68328 is implemented (assuming it sticks with the proposed design), this could instead be exposed as a static property on
IndexOfAnyValues
:in which case the same functionality could be achieved with:
The WhiteSpace property would cache a specialized concrete implementation that does what the proposed IndexOfAnyWhiteSpace would do.
Risks
No response