dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.35k stars 4.74k forks source link

Question on string.Split inconsistency between `count = 1` and `count > 1` #97920

Open Guiorgy opened 9 months ago

Guiorgy commented 9 months ago

When splitting a string with the StringSplitOptions.RemoveEmptyEntries option enabled, and with count parameter greater than 1, on the last split the current implementation tries to recursively removie any empty entries at the start of the string (after trimming if the option is set) until a non-empty entry is found. It then returns the rest of the string as the last split.

The comment in the source also indicates this behaviour:

// The next iteration of the loop will provide the final entry into the // results array. If needed, skip over all empty entries before that // point.

For example:

var str = "abaac";
var splits = str.Split('a', 2, StringSplitOptions.RemoveEmptyEntries);
// "aac" is reduced to "c", since splitting at the first 'a' would result in ["", "ac"], and empty entries are removed.
//         "ac" is similarly reduced to "c".
// final result (splits): ["b", "c"]

However, when count parameter is passed as 1, it always short-circuits, and only checks if the whole string is empty (after trimming if the option is set), and returns the whole string. It doesn't attempt to remove empty splits at the start of the string recursively as in the above case.

For example:

var str = "abaac";
var splits = str.Split('a', 1, StringSplitOptions.RemoveEmptyEntries);
// str is not empty so the whole string is returned. No further action is taken.
// final result (splits): ["abaac"]

The comment in the source indicates that this is per the documentation:

// Per the method's documentation, we'll short-circuit the search for separators. // But we still need to post-process the results based on the caller-provided flags.

Which according to learn.microsoft.com:

Remarks

If the string has already been split count - 1 times, but the end of the string has not been reached, then the last string in the returned array will contain this instance's remaining trailing substring, untouched.

If the last substring is the instance's remaining trailing substring, **untouched**, then why does the first case (count > 1) try to reduce it recursively? And if that is intentional, why does the second case (count = 1) not do the same?

This seems inconsistent.

I'd expect the results to either be both reduced recursively:

Or the last substring kept untouched in both cases:

My ultimate question is if this behaviour is completely expected and intended?

Edit: Another example to better illustrate:

var str = "aaaaaaaaaaaaaaaaaaaaaaaaaaaabaabaab";
var splits1 = str.Split('a', 1, StringSplitOptions.RemoveEmptyEntries);
var splits2 = str.Split('a', 2, StringSplitOptions.RemoveEmptyEntries);
var splits3 = str.Split('a', 3, StringSplitOptions.RemoveEmptyEntries);
// split1: ["aaaaaaaaaaaaaaaaaaaaaaaaaaaabaabaab"]
// split2: ["b", "baab"]
// split3: ["b", "b", "b"]

As you can see, changing count from 1 to 2 produces vastly different results.

ghost commented 9 months ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

Issue Details
When splitting a string with the `StringSplitOptions.RemoveEmptyEntries` option enabled, and with `count` parameter greater than 1, on the last split the [current implementation](https://github.com/dotnet/runtime/blob/60cfe26dcb28a7fd26c09d082466a7ea650fc505/src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs#L1807) continues to loop through splits removing any empty entries (after trimming if the option is set) until a non-empty entry is found. It then returns the rest of the string as the last split. For example: ```cs var str = "abaac"; var splits = str.Split('a', 2, StringSplitOptions.RemoveEmptyEntries); // "aac" is reduced to "c" // splits = ["b". "c"] ``` However, when `count` parameter is 1, it always short-circuits, [only checks if **the whole** string is empty](https://github.com/dotnet/runtime/blob/60cfe26dcb28a7fd26c09d082466a7ea650fc505/src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs#L1710) (after trimming if the option is set), and returns the whole string. It doesn't attempt to remove empty splits at the start as in the above case. For example: ```cs var str = "abaac"; var splits = str.Split('a', 1, StringSplitOptions.RemoveEmptyEntries); // str is not empty so re whole string is returned // splits = ["abaac"] ``` In the example above I was expecting the result to be `"baac"`. My question is if this behaviour is completely expected and intended?
Author: Guiorgy
Assignees: -
Labels: `area-System.Runtime`, `untriaged`
Milestone: -