Fixed and optimized split methods

Guiorgy commented 5 months ago

I also replaced the custom InvalidCountExceedingBehaviourException with the builtin ArgumentException since that is what string.Split throws when StringSplitOptions is invalid. If that is unacceptable, the commit can be dropped.

I also renamed the CountExceedingBehaviour options to, what I think, better explains them. Also reworded the summary comments. Again, if you disagree with that, you may drop that commit.

There's 1 more gatcha in string.Split: If separators (delimiters in SplitAny in our case) is an empty collection, the function automatically assumes that we wish to split the string with white spaces:

If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters

Sigh... So there are several ways to handle this:

Ignore it, and throw an exception if delimiters is empty.
Do what .NET does, or in other words, go through the source span first, get all whitespace characters present in a remporary array (which will allocate -_-), and then go through the source span once more to split it.
Write an alternative implementation where instead of checking if a character from source span is part of delimiters, we check if the character is a whitespace character using char.IsWhiteSpace(). Though I am assuming this is slower than the option above, otherwise why would .NET do it differently.
Any other thoughts?

dragon7307 commented 5 months ago

Looks good at first glance. I'll review, slightly adapt the style to better match the project, and merge it. I originally created, that custom exception to avoid having to throw two different exceptions for before and after .Net 7, but if string.Split uses ArgumentException, then this sounds good to me. I'll look into the behaviour for null or empty delimiters as well, though I would prefer not to throw an exception. We'll see...

dragon7307 commented 5 months ago

The methods look good to me, but I would still prefer not to merge the yet, since I'd like to conduct some unit tests to compare any any potential differences. Maybe waiting until the test suite is finished??

Guiorgy commented 5 months ago

Sure. The only failing tests is with an empty delimiters span, since we haven't implemented that yet. Though, as you said, the tests are a bit lacking still.

Also, fyi SpanExtensions still has StringSplitOptions and count reversed : Split(this Span<char> source, char delimiter, StringSplitOptions options, int count.

dragon7307 commented 5 months ago

Also, I have looked a bit more to the tests you've written and I believe, we should probably merge them into a feature branch of dev with the src files from dev and the tests files from your pr. Just so we have them somewhat up to date.

Guiorgy commented 5 months ago

I have a local branch with uptodate commits from the dev and tests branch for testing. Do you want me to merge the dev branch into tests, or make a separate branch for that?

dragon7307 commented 5 months ago

I have a branch in this repo, called "feature-unit-testing", in which I have adapted your tests branch as of yesterday to better suit the project's style and have brought it up to date with dev. Just take a look at the changes i made. if you agree with them, we have two options: we either merge any eventual changes you have locally into this branch and continue working on this branch, which is based of off dev, into which it will eventually be merged, or you push any changes you have made to tests to your fork and we then merge my updated branch with yours into your fork. If you don't agree with the changes I made, then yes, do as you propose, and merge dev into a branch created from tests.

dragon7307 commented 5 months ago

Also, unless I have gotten something dramatically wrong, we could refactor the monster that is GetAllStringSplitOptions to the following:

public static StringSplitOptions[] GetAllStringSplitOptions()
{
    return (StringSplitOptions[])Enum.GetValues(typeof(StringSplitOptions)); 
}

Guiorgy commented 5 months ago

I'll look into them, and probably just merge them into the tests branch. It was going to be merged with dev eventually anyways.

As for the GetAllStringSplitOptions method, the problem here is that StringSplitOptions is an enum marked with the Falgs attribute, which means you can combine options with the or (|) operator.

Try running this:

var options = (StringSplitOptions[])Enum.GetValues(typeof(StringSplitOptions));
Console.WriteLine(string.Join(", ", options));

var options2 = GetAllStringSplitOptions().Select(o => o.ToString().Replace(",", " |"));
Console.WriteLine(string.Join(", ", options2));

The output should look like this:

None, RemoveEmptyEntries, TrimEntries
None, RemoveEmptyEntries, TrimEntries, RemoveEmptyEntries | TrimEntries

While Enum.GetValues does get all defined enum values, it does not return their combinations when the enum is marked as flags.

The one I wrote was the simplest I could find, since the alternative would be trying to generate all permutations of the enums returned by Enum.GetValues, and remove duplicates (since Enum.GetValues also returns None, and None | TrimEntries is the same as TrimEntries).

If you have an alternative, don't hesitate to suggest.

dragon7307 commented 5 months ago

How about just hard-coding the values into an array?? They are unlikely to soon be changed and the current method seems overkill for such a simple purpose for a test suite.

Guiorgy commented 5 months ago

An alternative would be to manually hardcode all possible values, but you'd have to remember to update them if a new one was ever added, and if a lot of values were added, the number of all possible combinations would explode in size. I wanted to make something that (hopefully) doesn't need to ever be modified again.

The only reason why that code wouldn't work anymore (as stated in the comment next to the Debug.Assert), is if a new option was added that for some reason skipped a bit (for example, TrimEntries = 1 /* binary 001 */, SomeNewOption = 4 /* binary 100 */), which shouldn't happen (fingers crossed).

If you prefer hardcoding those values, I'm fine with changing that.

dragon7307 commented 5 months ago

Then let's leave it that way

Guiorgy commented 4 months ago

Run into another edge case when expanding the fuzzing with parameterized ([Theory]) testing.

If the separator span is empty in Split(source, delimiter) the enumeration never stops and keeps returning empty spans forever. Yikes.

Looking at string.Split(string, string), if the separator string is empty, it assumes count = 1 and goes from there. I'll do the same, unless you have other thoughts?

if (separator!.Length == 0)
{
    count = 1;
    goto ShortCircuit;
}
else
{
    return SplitInternal(separator, count, options);
}

Edit: Wait, I tested that against the old implementation (before this PR), give me a sec xd Edit2: It happens in this PR too 😬 Edit3: Fixed

dragon7307 commented 4 months ago

The current implementation tests for the delimiterLength in every iteration of Move, This is unnecessary, as delimiter.Lengthdoesn't change. The check should probably be in the constructor instead.

Guiorgy commented 4 months ago

I know that, and that's what I do in the ones that have count a a parameter. But I don't really see a way to do it from the constructor in the ones that don't.

When delimiter is empty, we need to shorcircuit, return the whole span and exit. The if statement that does that, checks for delimiterIndex which changes on every iteration, so there's no way to influence that from the constuctor. The only way I see is to either return an alternative IEnumerator, or add an additional check in the MoveNext() method.

Edit: If IndexOf returned -1 when delimiter was empty, that would also work.

The ones that have count already do check if CurrentCount <= 1, which can be iffluenced from the constructor, and is exactly what I did.

If you can come up with something, I am all ears.

dragon7307 commented 4 months ago

One could at least do the comparision once in the constructor and the have a field like DelimiterIsEmpty, which is set in the constructor to delimiter.IsEmpty and then checked...

Guiorgy commented 4 months ago

That is already basically what I do though?

DelimiterLength = Delimiter.Length; // cached in constructor
...
if(delimiterIndex == -1 || DelimiterLength == 0)

The only difference is that instead of a boolead, I am comparing an int to 0, but considering booleans are basically an int that is compared to 0 to check state, there's no real difference.

dragon7307 commented 4 months ago

Yes, but in theory your version should trigger an invocation of the comparision operator and thereby the Equals method. And sure, this is something that the compiler will optimize away, but since we cannot predict which compiler the consumer of this library will use, it is better to stay stafe, because maybe the official compiler optimizes it, but maybe someone uses mono, and maybe mono doesn't optimize this, or IL2CPP doesn't. Also, a boolean is in C#, unlike in say C, the size of a byte not of an int.

As you can see here, in jit asm a dword (4 bytes) is used for the comparison for the int . On the other hand, here with the boolean approach, a byte is used. That is a difference in memory allocation. Now, whether this will actually result in a measurable performance difference is debattable, but as a matter of fact, they produce different jit asm and hence work differently. The asm is in favour of using the boolean.

Here are the screenshots top to bottom, on the first is the approach with (DelimiterLength) and the second DelimiterIsEmpty:

And yes that is micro optimization, but so is the mere concept of Spans in C#.
Also, the example uses an int instead of generics, as generics can't be jit compiled by sharplab in c# and dotnet.

Guiorgy commented 4 months ago

Looking at the some of the changes you made recently on dev, you renamed CountExceedingBehavior enum entries.

I had also done that in this PR, as I had mentioned previously, and was hoping for you input on that.

Personally, instead of AppendLastElements or AppendRemainingElements I prefer IncludeRemainingElements, so include instead of append.

The way string.Split words it is:

[...] the last string in the returned array will contain this instance's remaining trailing substring, [...]

So, since contain and include are synonyms, I felt like it was appropriate.

As for CutLastElements and CutRemainingElements, I went with DropRemainingElements, since it's an antonym to include.

However, this isn't really that important, so if you say so, I'll update to the names you came up with.

dragon7307 commented 4 months ago

I see; I agree it isn't that important now. I get the idea of the antonyms; however I felt and do still feel that 'append' is more descriptive as to where the remaining elements are actually included (they're appended to the last element). The downside of this approach is that the antonym is not necessarilly a good fit for the contrary behaviour. There is no need to Update the pull request right now as the differences are growing so complex that I will need to merge it into a helper branch and then resolve many conflicts.

The final decision on the name is going to be made before the release of version 1.3.0, because I do not want to introduce a breaking change later on.

Guiorgy commented 4 months ago

There's 1 more thing missing in the current implementation. All the Span Split Enumerators work and return ReadOnlySpan, even if the source is a Span<T>. This means that the following is impossible:

char[] source = "Lorem ipsum dolor sit amet...";

// Split source by space and capitalize the first letter of every word
foreach (var word in source.AsSpan().Split(' '))
{
    if (word.Length == 0) continue;

    word[0] = word[0].ToUpper(); // impossible, since word is a ReadOnlySpan<char>
}

To fix this, we'd need to duplicate every split enumerable and make it work with normal mutable spans, and update all extension methods and tests. We could do that now, or maybe try to make a source generator first to not have to do all that manually?

dragon7307 commented 4 months ago

That seems like a job for the to be source generator

Guiorgy commented 4 months ago

I also added a check for validity of StringSplitOptions, which throws an exception when invalid. This is to copy what string.Split does.

Guiorgy commented 4 months ago

We should decide what are the final names for CountExceedingBehaviour, so I can fix the 2 PRs.

dragon7307 commented 4 months ago

I am pretty set on AppendRemainingElements and CutRemainingElements.

Guiorgy commented 4 months ago

I am pretty set on AppendRemainingElements and CutRemainingElements.

Ok, on it Edit: Done

draconware-dev / SpanExtensions.Net

Fixed and optimized split methods #11