Open Grevor opened 4 years ago
Tagging subscribers to this area: @tannergooding, @pgovind See info in area-owners.md if you want to be subscribed.
I'm generally happy to take perf PRs, provided all tests pass and numbers look good (both local and those on the perf lab hardware).
@adamsitnik please take a look. It seems that the issue can be closed with #83457
@adamsitnik please take a look. It seems that the issue can be closed with #83457
It looks like #83457 has included the +
and -
perf improvements, how about *
and /
? It would be great to try them and see what perf benefits we can get. Since this seems to be a small change I can give it a quick try and report back.
Yes, it need to be clarified.
But Karatsuba/Toom-Cook branch already use Add
method which contains this improvement.
The BigInteger class is fast already, but after doing a bit-twiddling kata/review on the source code I realized there were some further improvements to be made. Due to the ongoing effort to spanify the BigInteger #35565 I will leave the proposed code changes and their rationales below, with performance measurements as delta from existing code. I see no reason the same shouldn't apply for the spanified version as well. The code is spanified for later convenience.
All code can be found in the BigIntegerCalculator.* files.
If the proposed changes are accepted (with some further refinements, of course) I am willing to do the implementation on top of the spanification once it is merged. Further benchmarking also needs to be done to ensure this is not a "my machine" thing.
Configuration
.NET 5.0 (from repository, commit 3ac735b4b98811f23e89a36570d4e98994a191c3) Windows 10 19041 x64 Intel i7-8700K (Unclocked for the benchmark runs)
Add
Add should impose a check for carry == 0 in the last loop of both the half trivial case and the "full add":
Note the use of single
&
. This is to keep the single branch instructions. The JIT seem to fold this nicely in the x64 disassembly (It's one more instruction). The single & could also be used inAddSelf
. In general, the carry will very quickly reach zero, so this should enable big + small additions to mainly bememmove
. For some reason the change also gave better performance on cases I did not expect. The results are repeatable with new baselines as well.Subtract
The same as for Add. The carry will tend toward zero.
The results are repeatable.
Divide
This is where it gets wild. Changing the
SubtractDivisor
accordingly removes a TON of branch misspredictions. Measurements are needed on 32-bit machines due to the 64 bit operations.The results are a bit varied on this one, surprisingly. I do not fully understand why the first case degrades so much. It did have much lower missprediction rate, so there could be further optimizations to be done by perhaps checking the value
q
(for example, single binary digitq
values gives a non-uniform distribution under truncated multiplication with randomleft
andright
, IE. good for the current algorithm) and selecting the current code or the proposed one. On average I would suspect the proposed code to perform better, as the ratio of true to false branches is fairly close to 1 on average (slight overweight on branch taken).Multiply
The last one I have is uncertain. It has to do with loop-unrolling in the trivial case of
Multiply
. This enables the use of a dirty result buffers. The effects can only be seen in PowMod (with removed zeroing of the BitBuffers). However, there are some potential negative effects due to increased code size (double the instruction cache misses? Hard to say, processors are good at predictively streaming into them these days).