stephentoub commented 1 year ago

Background and motivation

It turns out it's quite common to want to search for whitespace (char.IsWhiteSpace) or things other than whitespace (!char.IsWhiteSpace). This is true not only in regex (\s and \S) (according to our nuget regex database there are ~13,000 occurrences of a regex that's simply \s) but also in many open-coded loops, e.g.

Etc. We should expose these as dedicated helpers, whether or not we're able to improve performance over a simple loop (we might be able to, for at least some kinds of input).

API Proposal

namespace System;

public static class MemoryExtensions
{
+   public static int IndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
+   public static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
+   public static int LastIndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
+   public static int LastIndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
}

This is only proposed for ReadOnlySpan<char> and not also Span<char>, since the most common case by far is expected to be spans derived from strings. The existing MemoryExtensions.IsWhiteSpace is also only exposed for ReadOnlySpan<char>.

API Usage

e.g. MemoryExtensions.IsWhiteSpace could be rewritten as simply:

public static bool IsWhiteSpace(this ReadOnlySpan<char> span) => span.IndexOfAnyExceptWhiteSpace() < 0;

Alternative Designs

If we want to expose these but don't want them to be so prominent, once https://github.com/dotnet/runtime/issues/68328 is implemented (assuming it sticks with the proposed design), this could instead be exposed as a static property on IndexOfAnyValues:

public static class IndexOfAnyValues
{
+   public static IndexOfAnyValues<char> WhiteSpace { get; }
}

in which case the same functionality could be achieved with:

int wsIndex = span.IndexOfAny(IndexOfAnyValues.WhiteSpace); // or IndexOfAnyExcept

The WhiteSpace property would cache a specialized concrete implementation that does what the proposed IndexOfAnyWhiteSpace would do.

Risks

No response

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.

Issue Details

### Background and motivation It turns out it's quite common to want to search for whitespace (`char.IsWhiteSpace`) or things other than whitespace (`!char.IsWhiteSpace`). This is true not only in regex (`\s` and `\S`) (according to our [nuget regex database](https://github.com/dotnet/runtime-assets/blob/main/src/System.Text.RegularExpressions.TestData/Regex_RealWorldPatterns.json) there are ~13,000 occurrences of a regex that's simply `\s`) but also in many open-coded loops, e.g. - https://github.com/dotnet/sdk/blob/4a27d47440dd3eaefbdb31135ef8867f6758161f/src/Tasks/Microsoft.NET.Build.Tasks/LockFileExtensions.cs#L127-L138 - https://github.com/JimmyCushnie/JimmysUnityUtilities/blob/834059548b2b392d692ddcf28194692e3ae7b2c1/Scripts/Extensions/Csharp%20types/StringBuilderExtensions.cs#L68-L77 - https://github.com/cake-build/cake/blob/a0298c0b5f76f819f0cc0d16ac9ef55d8b26adf9/src/Cake.Core/Configuration/Parser/ConfigurationParser.cs#L107-L117 - https://github.com/rubberduck-vba/Rubberduck/blob/3a9b233cf6ab519773d188e77d09ee8d8111bf49/Rubberduck.Core/UI/Refactorings/AnnotateDeclaration/AnnotationArgumentViewModel.cs#L195-L198 - https://github.com/aspnet/Razor/blob/5439cfe540084edd673b7ed626f2ec9cf3f13b18/src/Microsoft.AspNetCore.Razor.Language/DirectiveTokenEditHandler.cs#L36-L47 - https://github.com/OmniSharp/omnisharp-roslyn/blob/3ae5c8acd7ea3f03ab9e24c28280a320c573721a/src/OmniSharp.Cake/Configuration/Parser/ConfigurationParser.cs#L94-L104 - https://github.com/Unity-Technologies/UnityCsReference/blob/332310b494c5416cdae6c1209dbae7cfa6847c8d/Editor/Mono/Scripting/Compilers/MicrosoftResponseFileParser.cs#L195-L202 - https://github.com/PowerShell/PowerShell/blob/a2ee05400f8cb4a44cd87742f95ebc2c3472e649/src/System.Management.Automation/engine/parser/DebugViewWriter.cs#L1197-L1205 - https://github.com/mono/mono/blob/e2c5f4b0ad1a6b21ca0735f0b35b8611d4ad87b3/mcs/class/referencesource/System.Core/Microsoft/Scripting/Ast/DebugViewWriter.cs#L1155-L1162 - https://github.com/stripe/stripe-dotnet/blob/42dbc8371c5a4ee36df8933e6d72c2c2e3e41d2e/src/Stripe.net/Infrastructure/StringUtils.cs#L9 - https://github.com/mono/mono/blob/e2c5f4b0ad1a6b21ca0735f0b35b8611d4ad87b3/mcs/class/referencesource/System.Web/UI/Util.cs#L995-L1002 - https://github.com/VahidN/EFSecondLevelCache.Core/blob/1de038417ba22c40d9ebe411b67c9e1a7e4ad838/src/EFSecondLevelCache.Core/EFQueryExpressionVisitor.cs#L883-L893 - https://github.com/InstaSharp/InstaSharp/blob/7ab2aad6bdef175dd63620bf39f74fcf02696898/src/InstaSharp/Extensions/StringExtensions.cs#L13-L18 - https://github.com/FirelyTeam/Fhir.Metrics/blob/dd574b76077280299fd7104c754481a2e143ca72/src/Fhir.Metrics/Utils/Parser.cs#L57 - https://github.com/baohaojun/beagrep/blob/b1d56ef14d1d663d43b6af198600caa21623d2f2/Util/StringFu.cs#L388-L394 - https://github.com/dotnet/runtime/blob/264d7391ec9f6e698051db0621c5e090d0ae4710/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.Globalization.cs#L17-L25 - https://github.com/dotnet/runtime/blob/264d7391ec9f6e698051db0621c5e090d0ae4710/src/libraries/System.Linq.Expressions/src/System/Linq/Expressions/DebugViewWriter.cs#L1194-L1204 - https://github.com/dotnet/runtime/blob/264d7391ec9f6e698051db0621c5e090d0ae4710/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.Trim.cs#L568-L589 - Etc. We should expose these as dedicated helpers, whether or not we're able to improve performance over a simple loop (we might be able to, for at least some kinds of input). ### API Proposal ```diff namespace System; public static class MemoryExtensions { + public static int IndexOfAnyWhiteSpace(this ReadOnlySpan span); + public static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan span); } ``` - This is only proposed for `ReadOnlySpan` and not also `Span`, since the most common case by far is expected to be spans derived from strings. The existing MemoryExtensions.IsWhiteSpace is also only exposed for `ReadOnlySpan`. ### API Usage e.g. MemoryExtensions.IsWhiteSpace could be rewritten as simply: ```csharp public static bool IsWhiteSpace(this ReadOnlySpan span) => span.IndexOfAnyExceptWhiteSpace() < 0; ``` ### Alternative Designs If we want to expose these but don't want them to be so prominent, once https://github.com/dotnet/runtime/issues/68328 is implemented (assuming it sticks with the proposed design), this could instead be exposed as a static property on `IndexOfAnyValues`: ```diff public static class IndexOfAnyValues { + public static IndexOfAnyValues WhiteSpace { get; } } ``` in which case the same functionality could be achieved with: ```C# int wsIndex = span.IndexOfAny(IndexOfAnyValues.WhiteSpace); // or IndexOfAnyExcept ``` The WhiteSpace property would cache a specialized concrete implementation that does what the proposed IndexOfAnyWhiteSpace would do. ### Risks _No response_

Author:	stephentoub
Assignees:	-
Labels:	`api-suggestion`, `area-System.Memory`
Milestone:	8.0.0

adamsitnik commented 1 year ago

The proposal LGTM. I assume we are going to provide a vectorized implementation for the most frequent cases (Latin1)?

stephentoub commented 1 year ago

I assume we are going to provide a vectorized implementation for the most frequent cases (Latin1)?

At least experiment with one. I think the methods would be worth adding even if not vectorized; I listed some examples of places code is doing this same operation, but there are many more... it's surprisingly common. That said, I think it'd be worth trying to vectorize it and see what the perf looks like, in particular when there's non-ASCII input. My initial thought was we'd basically vectorize the search for all the ASCII whitespace chars (0x9 through 0xD and 0x20) as well as anything > 0x7F, and if we find anything > 0x7F, then fall back to a scalar loop just using char.IsWhiteSpace (I don't know if there'd be an efficient way to vectorize the search for the Unicode whitespace characters). My main concern with this would be that if the input wasn't ASCII, we'd likely frequently be paying the overhead of an initial vector on top of then taking the scalar path.

MihaZupan commented 1 year ago

80963 means that this could be vectorized when using `IndexOfAnyValues` (at least for non-except overloads), but if we expect matches to commonly occur at the start of the input (e.g. Trim), we may not want to use it anyway due to the added overhead.

bartonjs commented 1 year ago

Video

Looks good as proposed.

When looking for consistency, we saw that the ReadOnlySpan<char> methods on MemoryExtensions are not entirely consistent with whether there's also a Span<char> overload. In the event that the Span<char> members are desired, they're approved by pattern-matching.

Notably, MemoryExtensions.IsWhiteSpace does not have a Span<char> overload.

namespace System;

public static partial class MemoryExtensions
{
    public static int IndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
    public static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
    public static int LastIndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
    public static int LastIndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);

    // Edit: APIs approved as part of #86528
    public static bool ContainsAnyWhiteSpace(this ReadOnlySpan<char> span);
    public static bool ContainsAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
}

stephentoub commented 1 year ago

As of #83992, the regex source generator will now emit its own variants of these, e.g. if it needs to search for \s it'll emit:

        /// <summary>Finds the next index of any character that matches a whitespace character.</summary>
        internal static int IndexOfAnyWhiteSpace(this ReadOnlySpan<char> span)
        {
            int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptWhiteSpace);
            if ((uint)i < (uint)span.Length)
            {
                if (span[i] <= 0x7f)
                {
                    return i;
                }

                do
                {
                    if (char.IsWhiteSpace(span[i]))
                    {
                        return i;
                    }
                    i++;
                }
                while ((uint)i < (uint)span.Length);
            }

            return -1;
        }

        /// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f!\"#$%&amp;'()*+,-./0123456789:;&lt;=&gt;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f".</summary>
        internal static readonly IndexOfAnyValues<char> s_asciiExceptWhiteSpace = IndexOfAnyValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f");

and for \S:

        /// <summary>Finds the next index of any character that matches any character other than a space character.</summary>
        internal static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span)
        {
            int i = span.IndexOfAnyExcept(Utilities.s_asciiWhiteSpace);
            if ((uint)i < (uint)span.Length)
            {
                if (span[i] <= 0x7f)
                {
                    return i;
                }

                do
                {
                    if (!char.IsWhiteSpace(span[i]))
                    {
                        return i;
                    }
                    i++;
                }
                while ((uint)i < (uint)span.Length);
            }

            return -1;
        }

        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r ".</summary>
        internal static readonly IndexOfAnyValues<char> s_asciiWhiteSpace = IndexOfAnyValues.Create("\t\n\v\f\r ");

(It's capable now of emitting many variations of these... it just chooses good names for the methods for these known sets, but will generate a similarly shaped method for any set it doesn't have a better strategy for.)

I suggest we put this issue on hold until we get some more experience with those implementations and how they fair. Once we're happy with them, we can effectively promote them up to MemoryExtensions per this issue, and the regex source generator / compiler can be changed to use those instead.

GrabYourPitchforks commented 1 year ago

A nit - the names IndexOfAnyWhiteSpace and IndexOfAnyExceptWhiteSpace read very strangely to me. I know we're establishing a pattern around Any and AnyExcept, but is it possible to tweak these names to make them a little more human? Breaking the naming pattern in this instance might help improve readability and understandability.

I'm assuming the respective definitions are "return the index of the first whitespace char you see; or -1 if you find no whitespace chars or the string is empty" and "return the index of the first non-whitespace char you see; or -1 if the string contains only whitespace chars or is empty."

Some proposals:

IndexOfWhiteSpace and IndexOfNonWhiteSpace (remove the "any", since the term whitespace already implies a set of possible matching characters rather than a single possible char match)
IndexOfAnyWhiteSpace and IndexOfAnyNonWhiteSpace (keep the "any"; change the "except" to "non")

danmoseley commented 8 months ago

The capitalization of space in WhiteSpace is jarring for me given we consistently write whitespace as a single word even in this very issue.

stephentoub commented 8 months ago

The capitalization of space in WhiteSpace is jarring for me given we consistently write whitespace as a single word even in this very issue.

I don't disagree, but we have char.IsWhiteSpace, Rune.IsWhiteSpace, string.IsNullOrWhiteSpace, MemoryExtensions.IsWhiteSpace, ArgumentException.ThrowIfNullOrWhiteSpace, FromBase64TransformMode.IgnoreWhiteSpaces, ... outside of System.Xml and one enum value in regex, we've pretty consistently used "WhiteSpace" rather than "Whitespace" and shouldn't change now.

bartonjs commented 8 months ago

It was contentious enough that Krzysztof wrote it in the table of correct spellings in Framework Design Guidelines 2nd edition (or maybe even first).

Pascal: WhiteSpace, Camel: whiteSpace, Not: Whitespace

Which is in opposition to Timestamp:

Pascal: Timestamp, Camel: timestamp, Not: TimeStamp

dotnet / runtime

[API Proposal]: MemoryExtensions.IndexOfAny{Except}WhiteSpace #77959

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

80963 means that this could be vectorized when using `IndexOfAnyValues` (at least for non-except overloads), but if we expect matches to commonly occur at the start of the input (e.g. Trim), we may not want to use it anyway due to the added overhead.

dotnet / runtime

[API Proposal]: MemoryExtensions.IndexOfAny{Except}WhiteSpace #77959

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

80963 means that this could be vectorized when using IndexOfAnyValues (at least for non-except overloads), but if we expect matches to commonly occur at the start of the input (e.g. Trim), we may not want to use it anyway due to the added overhead.

80963 means that this could be vectorized when using `IndexOfAnyValues` (at least for non-except overloads), but if we expect matches to commonly occur at the start of the input (e.g. Trim), we may not want to use it anyway due to the added overhead.