dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.44k stars 4.76k forks source link

[API Proposal]: Efficiently copy a string with a set of chars removed (and/or kept) #101235

Open Joe4evr opened 7 months ago

Joe4evr commented 7 months ago

Background and motivation

There are times when you get some string and want to normalize/sanitize it in a way that certain characters are removed (or conversely, the result string consists only of some set of characters).

Additionally/Alternatively, a more secure method would be one where the user specifies an allow-list of chars. This would prevent the hassle of trying to exclude everything except the (usually smaller) set you want.

The current alternatives for these is hand-writing an appropriate loop, but that would be more involved and potentially error-prone (especially if you wanted to have these operations vectorized).

cc: @MihaZupan had some insight about this on Discord

API Proposal

namespace System;

public partial class String
{
    // Returns a string *without* the specified set of chars,
    // equivalent to multiple 'Replace(oneCharString, String.Empty)' calls.
    // If it had none of the specified chars to begin with,
    // can return 'this' to avoid allocating a duplicate.
    public string RemoveChars(params ReadOnlySpan<char> chars);

    // Returns a string consisting *only* of the specified set of chars,
    // equivalent to 'new string(str.Where(c => set.Contains(c)).ToArray())'.
    // If it already consisted of only the specified chars to begin with,
    // can return 'this' to avoid allocating a duplicate.
    public string KeepChars(params ReadOnlySpan<char> chars);

    // Point of discussion: Overloads taking in SearchValues<char>
    // Nice-to-have? Must-have? Outright replace the ROS<char> versions?
    public string RemoveChars(SearchValues<char> values);
    public string KeepChars(SearchValues<char> values);
}

API Usage

string someString = "123-456";
string noOdds = someString.RemoveChars("13579"); // returns "2-46"
string onlyDigits = someString.KeepChars("0123456789"); // returns "123456"

Alternative Designs

No response

Risks

A slight increase in String's method table? They could be defined as extension methods if that's a genuine concern.

Additional notes:

I can imagine that some parts higher up the stack that also deal with strings could create their own extension "overloads" taking in their own complex type for convenience. This would be up to the discretion of the relevant area owner, but for a concrete example: System.Text.Encodings.Web could add something like

public static string RemoveChars(this string source, params ReadOnlySpan<UnicodeRange> ranges);

so that users can piggyback off of the UnicodeRange type if their project already references it anyway.

Charlieface commented 5 months ago

Func<char, bool> predicates would make sense also

    public string RemoveChars(Func<char, bool> predicate);

    public string KeepChars(Func<bool, char> predicate);

This would allow you to do

someString = someString.KeepChars(char.LetterOrDigit)
julealgon commented 5 months ago
  • Currently, the "easy" option for this is to call str.Replace a bunch of times, each with a 1-length string to be replaced with String.Empty, but this allocates n-1 intermediate strings.

Would it be possible to optimize the original Replace method, instead of introducing new APIs?

Clockwork-Muse commented 5 months ago
  • Currently, the "easy" option for this is to call str.Replace a bunch of times, each with a 1-length string to be replaced with String.Empty, but this allocates n-1 intermediate strings.

Would it be possible to optimize the original Replace method, instead of introducing new APIs?

The original methods are optimized, it's just that because string must be immutable the result of that call has to be a complete string. Which is where the problem is - you end up with a copy for each "step". The only way around is to create a new method/overload that's equivalent to the signature proposed here either way.

julealgon commented 5 months ago
  • Currently, the "easy" option for this is to call str.Replace a bunch of times, each with a 1-length string to be replaced with String.Empty, but this allocates n-1 intermediate strings.

Would it be possible to optimize the original Replace method, instead of introducing new APIs?

The original methods are optimized, it's just that because string must be immutable the result of that call has to be a complete string. Which is where the problem is - you end up with a copy for each "step". The only way around is to create a new method/overload that's equivalent to the signature proposed here either way.

Sorry, I didn't express myself well there. I wanted "without changing the API" to mean "without introducing new method names" on this one. For example, keeping the "Replace" name, but providing overloads to perform multiple replacements at once since the issue appears to be that each call today is limited to a single replacement.