Open Guiorgy opened 5 months ago
Looks good at first glance. I'll review, slightly adapt the style to better match the project, and merge it.
I originally created, that custom exception to avoid having to throw two different exceptions for before and after .Net 7, but if string.Split
uses ArgumentException
, then this sounds good to me. I'll look into the behaviour for null or empty delimiters as well, though I would prefer not to throw an exception. We'll see...
The methods look good to me, but I would still prefer not to merge the yet, since I'd like to conduct some unit tests to compare any any potential differences. Maybe waiting until the test suite is finished??
Sure. The only failing tests is with an empty delimiters span, since we haven't implemented that yet. Though, as you said, the tests are a bit lacking still.
Also, fyi SpanExtensions
still has StringSplitOptions
and count
reversed : Split(this Span<char> source, char delimiter, StringSplitOptions options, int count
.
Also, I have looked a bit more to the tests you've written and I believe, we should probably merge them into a feature branch
of dev with the src
files from dev and the tests
files from your pr. Just so we have them somewhat up to date.
I have a local branch with uptodate commits from the dev
and tests
branch for testing. Do you want me to merge the dev
branch into tests
, or make a separate branch for that?
I have a branch in this repo, called "feature-unit-testing", in which I have adapted your tests
branch as of yesterday to better suit the project's style and have brought it up to date with dev. Just take a look at the changes i made. if you agree with them, we have two options: we either merge any eventual changes you have locally into this branch and continue working on this branch, which is based of off dev
, into which it will eventually be merged, or you push any changes you have made to tests
to your fork and we then merge my updated branch with yours into your fork.
If you don't agree with the changes I made, then yes, do as you propose, and merge dev into a branch created from tests.
Also, unless I have gotten something dramatically wrong, we could refactor the monster that is GetAllStringSplitOptions
to the following:
public static StringSplitOptions[] GetAllStringSplitOptions()
{
return (StringSplitOptions[])Enum.GetValues(typeof(StringSplitOptions));
}
I'll look into them, and probably just merge them into the tests
branch. It was going to be merged with dev
eventually anyways.
As for the GetAllStringSplitOptions
method, the problem here is that StringSplitOptions
is an enum marked with the Falgs
attribute, which means you can combine options with the or (|
) operator.
Try running this:
var options = (StringSplitOptions[])Enum.GetValues(typeof(StringSplitOptions));
Console.WriteLine(string.Join(", ", options));
var options2 = GetAllStringSplitOptions().Select(o => o.ToString().Replace(",", " |"));
Console.WriteLine(string.Join(", ", options2));
The output should look like this:
None, RemoveEmptyEntries, TrimEntries
None, RemoveEmptyEntries, TrimEntries, RemoveEmptyEntries | TrimEntries
While Enum.GetValues
does get all defined enum values, it does not return their combinations when the enum is marked as flags.
The one I wrote was the simplest I could find, since the alternative would be trying to generate all permutations of the enums returned by Enum.GetValues
, and remove duplicates (since Enum.GetValues
also returns None
, and None | TrimEntries
is the same as TrimEntries
).
If you have an alternative, don't hesitate to suggest.
How about just hard-coding the values into an array?? They are unlikely to soon be changed and the current method seems overkill for such a simple purpose for a test suite.
An alternative would be to manually hardcode all possible values, but you'd have to remember to update them if a new one was ever added, and if a lot of values were added, the number of all possible combinations would explode in size. I wanted to make something that (hopefully) doesn't need to ever be modified again.
The only reason why that code wouldn't work anymore (as stated in the comment next to the Debug.Assert
), is if a new option was added that for some reason skipped a bit (for example, TrimEntries = 1 /* binary 001 */, SomeNewOption = 4 /* binary 100 */
), which shouldn't happen (fingers crossed).
If you prefer hardcoding those values, I'm fine with changing that.
Then let's leave it that way
Run into another edge case when expanding the fuzzing with parameterized ([Theory]
) testing.
If the separator span is empty in Split(source, delimiter)
the enumeration never stops and keeps returning empty spans forever. Yikes.
Looking at string.Split(string, string)
, if the separator string is empty, it assumes count = 1
and goes from there. I'll do the same, unless you have other thoughts?
if (separator!.Length == 0)
{
count = 1;
goto ShortCircuit;
}
else
{
return SplitInternal(separator, count, options);
}
Edit: Wait, I tested that against the old implementation (before this PR), give me a sec xd Edit2: It happens in this PR too 😬 Edit3: Fixed
The current implementation tests for the delimiterLength in every iteration of Move
, This is unnecessary, as delimiter.Length
doesn't change. The check should probably be in the constructor instead.
I know that, and that's what I do in the ones that have count
a a parameter. But I don't really see a way to do it from the constructor in the ones that don't.
When delimiter is empty, we need to shorcircuit, return the whole span and exit. The if
statement that does that, checks for delimiterIndex
which changes on every iteration, so there's no way to influence that from the constuctor. The only way I see is to either return an alternative IEnumerator
, or add an additional check in the MoveNext()
method.
Edit: If IndexOf
returned -1 when delimiter
was empty, that would also work.
The ones that have count
already do check if CurrentCount <= 1
, which can be iffluenced from the constructor, and is exactly what I did.
If you can come up with something, I am all ears.
One could at least do the comparision once in the constructor and the have a field like DelimiterIsEmpty
, which is set in the constructor to delimiter.IsEmpty
and then checked...
That is already basically what I do though?
DelimiterLength = Delimiter.Length; // cached in constructor
...
if(delimiterIndex == -1 || DelimiterLength == 0)
The only difference is that instead of a boolead, I am comparing an int to 0, but considering booleans are basically an int that is compared to 0 to check state, there's no real difference.
Yes, but in theory your version should trigger an invocation of the comparision operator and thereby the Equals
method.
And sure, this is something that the compiler will optimize away, but since we cannot predict which compiler the consumer of this library will use, it is better to stay stafe, because maybe the official compiler optimizes it, but maybe someone uses mono, and maybe mono doesn't optimize this, or IL2CPP doesn't. Also, a boolean is in C#, unlike in say C, the size of a byte not of an int.
As you can see here, in jit asm a dword
(4 bytes) is used for the comparison for the int . On the other hand, here with the boolean approach, a byte
is used. That is a difference in memory allocation. Now, whether this will actually result in a measurable performance difference is debattable, but as a matter of fact, they produce different jit asm and hence work differently. The asm is in favour of using the boolean.
Here are the screenshots top to bottom, on the first is the approach with (DelimiterLength) and the second DelimiterIsEmpty:
And yes that is micro optimization, but so is the mere concept of Spans in C#.
Also, the example uses an int instead of generics, as generics can't be jit compiled by sharplab in c# and dotnet.
Looking at the some of the changes you made recently on dev, you renamed CountExceedingBehavior
enum entries.
I had also done that in this PR, as I had mentioned previously, and was hoping for you input on that.
Personally, instead of AppendLastElements
or AppendRemainingElements
I prefer IncludeRemainingElements
, so include instead of append.
The way string.Split
words it is:
[...] the last string in the returned array will contain this instance's remaining trailing substring, [...]
So, since contain and include are synonyms, I felt like it was appropriate.
As for CutLastElements
and CutRemainingElements
, I went with DropRemainingElements
, since it's an antonym to include.
However, this isn't really that important, so if you say so, I'll update to the names you came up with.
I see; I agree it isn't that important now. I get the idea of the antonyms; however I felt and do still feel that 'append' is more descriptive as to where the remaining elements are actually included (they're appended to the last element). The downside of this approach is that the antonym is not necessarilly a good fit for the contrary behaviour. There is no need to Update the pull request right now as the differences are growing so complex that I will need to merge it into a helper branch and then resolve many conflicts.
The final decision on the name is going to be made before the release of version 1.3.0, because I do not want to introduce a breaking change later on.
There's 1 more thing missing in the current implementation. All the Span Split Enumerators work and return ReadOnlySpan
, even if the source is a Span<T>
. This means that the following is impossible:
char[] source = "Lorem ipsum dolor sit amet...";
// Split source by space and capitalize the first letter of every word
foreach (var word in source.AsSpan().Split(' '))
{
if (word.Length == 0) continue;
word[0] = word[0].ToUpper(); // impossible, since word is a ReadOnlySpan<char>
}
To fix this, we'd need to duplicate every split enumerable and make it work with normal mutable spans, and update all extension methods and tests. We could do that now, or maybe try to make a source generator first to not have to do all that manually?
That seems like a job for the to be source generator
I also added a check for validity of StringSplitOptions
, which throws an exception when invalid. This is to copy what string.Split
does.
We should decide what are the final names for CountExceedingBehaviour
, so I can fix the 2 PRs.
I am pretty set on AppendRemainingElements
and CutRemainingElements
.
I am pretty set on
AppendRemainingElements
andCutRemainingElements
.
Ok, on it Edit: Done
I also replaced the custom
InvalidCountExceedingBehaviourException
with the builtinArgumentException
since that is whatstring.Split
throws whenStringSplitOptions
is invalid. If that is unacceptable, the commit can be dropped.I also renamed the
CountExceedingBehaviour
options to, what I think, better explains them. Also reworded the summary comments. Again, if you disagree with that, you may drop that commit.There's 1 more gatcha in
string.Split
: Ifseparators
(delimiters
inSplitAny
in our case) is an empty collection, the function automatically assumes that we wish to split the string with white spaces:Sigh... So there are several ways to handle this:
char.IsWhiteSpace()
. Though I am assuming this is slower than the option above, otherwise why would .NET do it differently.