Open DrewScoggins opened 3 years ago
https://github.com/dotnet/runtime/pull/38001 is the change that is causing this. Here are the changes that went in between - https://github.com/dotnet/runtime/compare/10ed7401853185a1dbbe24e5a56240ae293e1486...740121d09706a7e91097ddb286baf5ff539a6d13. Assigning to @danmoseley to further triage and assign as appropriate.
@bbartels did you run the relevant perf benchmarks in this repo on your PR? I had thought you had, but perhaps you ran only your own created benchmarks?
I'll push a up a revert and we can figure this out then get the change in again. cc @tannergooding
I thought we had also validated with some benchmarks @GrabYourPitchforks shared: https://github.com/dotnet/runtime/pull/38001#issuecomment-645670458 and https://github.com/dotnet/runtime/pull/38001#discussion_r602532928
Will need to investigate to see what's causing the regression in for these scenarios in particular.
If those other benchmarks are useful, it might be good to get them added to dotnet/performance as well.
@bbartels did you run the relevant perf benchmarks in this repo on your PR? I had thought you had, but perhaps you ran only your own created benchmarks?
I'll push a up a revert and we can figure this out then get the change in again. cc @tannergooding
I did yeah, should be in the thread. I remember having discussed this and we concluded that the performance trade-off was worth it. Note, that the regression benchmark is splitting on a separator that occurs every other character. There is some overhead when the frequency of separators is high, such as in the example above. However, with a separator every three or more characters it does not result in regressions anymore.
I agree with @bbartels. Even though this microbenchmark shows a regression, I don't believe the microbenchmark is representative of real-world inputs. Back in the original thread I gave an example showing how the PR behaves when parsing CSV - which is a very typical use case for string.Split
- and it showed measurable gains. My opinion is that we should keep the PR as-is unless we can trace some aspnet or other non-microbenchmark regression to this change.
If that is the case, I would recommend updating/adding the realistic microbenchmark - @DrewScoggins
I did yeah, should be in the thread.
My bad, it got really long 😊
My opinion is that we should keep the PR as-is
This sounds good to me, if we can replace the microbenchmark instead. @tannergooding ?
That sounds reasonable to me. I think we want a new benchmark (and maybe to remove the existing one) rather than replacing the existing one so things can be tracked properly, but that is likely a question for @DrewScoggins and @adamsitnik.
The method does do a lot of stack push/pop, including 4 vectors
push r15
push r14
push r13
push r12
push rdi
push rsi
push rbp
push rbx
sub rsp, 0x88
vzeroupper
vmovaps [rsp+0x70], xmm6
vmovaps [rsp+0x60], xmm7
vmovaps [rsp+0x50], xmm8
vmovaps [rsp+0x40], xmm9
...
vmovaps xmm6, [rsp+0x70]
vmovaps xmm7, [rsp+0x60]
vmovaps xmm8, [rsp+0x50]
vmovaps xmm9, [rsp+0x40]
add rsp, 0x88
pop rbx
pop rbp
pop rsi
pop rdi
pop r12
pop r13
pop r14
pop r15
ret
Adding an extra overload for only one separator cuts that by 2 vectors
push r15
push r14
push r13
push r12
push rdi
push rsi
push rbp
push rbx
sub rsp, 0x68
vzeroupper
vmovaps [rsp+0x50], xmm6
vmovaps [rsp+0x40], xmm7
...
vmovaps xmm6, [rsp+0x50]
vmovaps xmm7, [rsp+0x40]
add rsp, 0x68
pop rbx
pop rbp
pop rsi
pop rdi
pop r12
pop r13
pop r14
pop r15
ret
As well as dropping 4 SIMD instructions
cmp = Sse2.Or(Sse2.CompareEqual(charVector, v2), cmp);
cmp = Sse2.Or(Sse2.CompareEqual(charVector, v3), cmp);
vmovupd xmm0, [r15+rcx*2]
+ vpcmpeqw xmm0, xmm0, xmm7
- vpcmpeqw xmm1, xmm0, xmm7
- vpcmpeqw xmm2, xmm0, xmm8
- vpor xmm1, xmm2, xmm1
- vpcmpeqw xmm0, xmm0, xmm9
- vpor xmm1, xmm0, xmm1
- vptest xmm1, xmm1
+ vptest xmm0, xmm0
So that might be worth considering as well; even though its more code?
@benaadams I could run some benchmarks tomorrow to see if it would result in significant enough perf gains to warrant the change!
Yeah, if we have what we think are more representative Split perf tests, adding those as new tests to the perf suite should be exactly what we do.
@PranavSenthilnathan We want to replace the existing dotnet/performance benchmarks with the ones used in the PR that implemented the change per the comments above.
Once the new benchmarks are merged in and we get some baseline data from them, we can pursue the optimization suggested. The new benchmark should be able to run against .NET 8+ too, so we'll be able to see a difference between .NET 8/9 and .NET 10 if the change proves to have a measurable effect.
Run Information
Regressions in System.Tests.Perf_String
Historical Data in Reporting System
Repro