dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.17k stars 4.72k forks source link

[API Proposal]: Add IndexOfAny string[] overloads #98106

Closed CypherPotato closed 3 months ago

CypherPotato commented 8 months ago

Background and motivation

Today in .NET 8, there is String.IndexOfAny(char[]), which only accepts char[] to look for the first incidence of whatever appears first. It would be interesting if there was a String.IndexOfAny(string[]) to find other strings in the same string.

API Proposal

Implementation about it here: https://github.com/dotnet/runtime/commit/df92b94e5ca1e33f9ae5196109a5974f8a68a344

API Usage

string testString = "foo bar daz kar";

testString.ContainsAny("yez", "daz"); // true since "daz" is present
testString.IndexOfAny("jaz", "bar"); // returns 4

Alternative Designs

ContainsAny might be an overload of Contains, but I still think it makes sense to have a dedicated method for it.

Risks

The methods are new and do not change the existing .NET API and would only introduce something new. The risks would be if there are extensions named ContainsAny.

ghost commented 8 months ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Today in .NET 8, there is String.IndexOfAny(char[]), which only accepts char[] to look for the first incidence of whatever appears first. It would be interesting if there was a String.IndexOfAny(string[]) to find other strings in the same string. ### API Proposal Implementation about it here: https://github.com/dotnet/runtime/commit/df92b94e5ca1e33f9ae5196109a5974f8a68a344 ### API Usage ```csharp string testString = "foo bar daz kar"; testString.ContainsAny("yez", "daz"); // true since "daz" is present testString.IndexOfAny("jaz", "bar"); // returns 4 ``` ### Alternative Designs ContainsAny might be an overload of Contains, but I still think it makes sense to have a dedicated method for it. ### Risks The methods are new and do not change the existing .NET API and would only introduce something new. The risks would be if there are extensions named `ContainsAny`.
Author: CypherPotato
Assignees: -
Labels: `api-suggestion`, `area-System.Runtime`, `untriaged`
Milestone: -
vcsjones commented 8 months ago

Related to this is #85573 which introduces SearchValues for string and is done for .NET 9. For example:

using System;
using System.Buffers;

string testString = "foo bar daz kar";
SearchValues<string> search = SearchValues.Create(["yez", "daz"], StringComparison.Ordinal);
Console.WriteLine(testString.AsSpan().ContainsAny(search));
Console.WriteLine(testString.AsSpan().IndexOfAny(search));
CypherPotato commented 8 months ago

Would a shortcut to access this new functionality then make sense?

Something like this maybe:

public int IndexOfAny(string[] anyOf)
{
    SearchValues<string> search = System.Buffers.SearchValues.Create(anyOf, StringComparison.Ordinal);
    return search.AsSpan().IndexOfAny(this);
}

public int ContainsAny(string[] anyOf)
{
    return IndexOfAny(anyOf) >= 0;
}
stephentoub commented 8 months ago

That would be incredibly expensive. The point of SearchValues is to do computation once to then optimize all subsequent searches using it. The suggested implementation would incur that non-trivial overhead on every use.

The linked commit (https://github.com/dotnet/runtime/commit/df92b94e5ca1e33f9ae5196109a5974f8a68a344) would also be problematic for implementation, as it would vary in semantics from other IndexOfAny's that return the next occurrence: that proposed implementation which searches the whole input for each string one at a time might not report the next occurrence of any of them but instead a later occurrence of whichever string was searched for first.

CypherPotato commented 8 months ago

Please, don't consider the codes here as final implementations but ideas for how it could be possible to make this API a real thing.

Optimizations and improvements should be made after the idea is approved, if so. These codes were not designed for production code but only to express the idea.

As much as SearchValues ​​exists, I still see that having an implementation within System.String would make sense.

stephentoub commented 8 months ago

My point was more that the efficient thing to do here is to use SearchValues. The proposed API wouldn't be close to as efficient and would only be appropriate if the set of strings being searched for was dynamic, with the set changing effectively on every call. Can you describe your scenario that fits that case?

CypherPotato commented 8 months ago

I personally didn't even know about SearchValues. When I eventually needed to check multiple incidences of several strings in a single string, I did something similar to what I presented here. I had to check the string piece by piece until I got the final result.

I noticed that System.String.IndexOfAny is a thing now, but I don't understand why IndexOf has overloads for string and char but IndexOfAny only has overloads for char. This issue specifically talks about it.

Considering the example I mentioned above (not the commit one) it would only be used once, instead of creating an instance of SearchBuffer<T>, it would save lines with IndexOfAny(string[]).

If it's something that needs to be tested over and over again, it would save on optimization time by creating an SearchBuffer instance, but even for eventual uses, which I don't need reusing the specified buffer, I see that would make sense using it on string.IndexOfAny(string[]).

stephentoub commented 8 months ago

I personally didn't even know about SearchValues.

SearchValues<T> was added in .NET 8. Support for SearchValues<string> is new in .NET 9.

I don't understand why IndexOf has overloads for string and char but IndexOfAny only has overloads for char

Because the need to search for multiple substrings is much more rare and much more challenging to do efficiently.

Considering the example I mentioned above (not the commit one) it would only be used once, instead of creating an instance of SearchBuffer, it would save lines with IndexOfAny(string[]).

If you want to use it only once and you really don't care about efficiency, you can do:

source.IndexOfAny(SearchValues.Create(["one", "two"]));

We would not expose an API implemented like that, though, as it would be performance pit of failure.