dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.08k stars 4.69k forks source link

BigInteger performance improvements #41495

Open Grevor opened 4 years ago

Grevor commented 4 years ago

The BigInteger class is fast already, but after doing a bit-twiddling kata/review on the source code I realized there were some further improvements to be made. Due to the ongoing effort to spanify the BigInteger #35565 I will leave the proposed code changes and their rationales below, with performance measurements as delta from existing code. I see no reason the same shouldn't apply for the spanified version as well. The code is spanified for later convenience.

All code can be found in the BigIntegerCalculator.* files.

If the proposed changes are accepted (with some further refinements, of course) I am willing to do the implementation on top of the spanification once it is merged. Further benchmarking also needs to be done to ensure this is not a "my machine" thing.

Configuration

.NET 5.0 (from repository, commit 3ac735b4b98811f23e89a36570d4e98994a191c3) Windows 10 19041 x64 Intel i7-8700K (Unclocked for the benchmark runs)

Add

Add should impose a check for carry == 0 in the last loop of both the half trivial case and the "full add":

            int i = 0;
            for ( ; carry != 0 & i < left.Length; i++)
            {
                long digit = left[i] + carry;
                bits[i] = unchecked((uint)digit);
                carry = digit >> 32;
            }
            left.Slice(i).CopyTo(bits.Slice(i));
            bits[left.Length] = (uint)carry;

Note the use of single &. This is to keep the single branch instructions. The JIT seem to fold this nicely in the x64 disassembly (It's one more instruction). The single & could also be used in AddSelf. In general, the carry will very quickly reach zero, so this should enable big + small additions to mainly be memmove. For some reason the change also gave better performance on cases I did not expect. The results are repeatable with new baselines as well.

| Faster                                                                      | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| --------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Add(arguments: 65536,65536 bits)      |      1.14 |          2403.37 |          2102.15 |         |
| System.Numerics.Tests.Perf_BigInteger.Add(arguments: 16,16 bits)            |      1.14 |             7.30 |             6.43 |         |

Subtract

The same as for Add. The carry will tend toward zero.

The results are repeatable.

| Faster                                                                      | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| --------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 16,16 bits)       |      1.07 |             7.60 |             7.07 |         |
| System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536,65536 bits) |      1.14 |          2391.64 |          2094.65 |         |

Divide

This is where it gets wild. Changing the SubtractDivisor accordingly removes a TON of branch misspredictions. Measurements are needed on 32-bit machines due to the 64 bit operations.

           ulong carry = 0UL;

            for (int i = 0; i < right.Length; i++)
            {
                carry += right[i] * q;
                uint digit = unchecked((uint)carry);
                carry = carry >> 32;
                ulong newDigit = unchecked((ulong)left[i] - digit);
                carry += (newDigit  >> 32 ) & 0x1; // This is the same as if (leftElement < digit) ++carry
                left[i] = unchecked((uint)newDigit);
            }

            return (uint)carry;

The results are a bit varied on this one, surprisingly. I do not fully understand why the first case degrades so much. It did have much lower missprediction rate, so there could be further optimizations to be done by perhaps checking the value q (for example, single binary digit q values gives a non-uniform distribution under truncated multiplication with random left and right, IE. good for the current algorithm) and selecting the current code or the proposed one. On average I would suspect the proposed code to perform better, as the ratio of true to false branches is fairly close to 1 on average (slight overweight on branch taken).

| Slower                                                                 | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| ---------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Divide(arguments: 1024,512 bits) |      1.27 |           524.26 |           665.83 |         |
| System.Numerics.Tests.Perf_BigInteger.Remainder(arguments: 1024,512 bits)        |      1.31 |           524.66 |           685.61 |         |

| Faster                                                                    | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.Divide(arguments: 65536,32768 bits) |      2.63 |       4249482.81 |       1617236.56 |         |
| System.Numerics.Tests.Perf_BigInteger.Remainder(arguments: 65536,32768 bits) |      2.61 |       4257183.59 |       1633850.63 |   

Multiply

The last one I have is uncertain. It has to do with loop-unrolling in the trivial case of Multiply. This enables the use of a dirty result buffers. The effects can only be seen in PowMod (with removed zeroing of the BitBuffers). However, there are some potential negative effects due to increased code size (double the instruction cache misses? Hard to say, processors are good at predictively streaming into them these days).

                if (right.Length <= 0)
                {
                    result.Clear();
                    return;
                }
                // Implied i = 0 and result = 0
                ulong carry = 0UL;
                for (int j = 0; j < left.Length; j++)
                {
                    ulong digits = carry
                        + (ulong)left[j] * right[0];
                    result[j] = unchecked((uint)digits);
                    carry = digits >> 32;
                }
                result[left.Length] = (uint)carry;

                for (int i = 1; i < right.Length; i++)
                {
                    carry = 0UL;
                    for (int j = 0; j < left.Length; j++)
                    {
                        ulong digits = result[i + j] + carry
                            + (ulong)left[j] * right[i];
                        result[i + j] = unchecked((uint)digits);
                        carry = digits >> 32;
                    }
                    result[i + left.Length] = (uint)carry;
                }
| Faster                                                                       | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ---------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Numerics.Tests.Perf_BigInteger.ModPow(arguments: 1024,1024,64 bits)   |      1.14 |        130853.59 |        114545.93 |         |
| System.Numerics.Tests.Perf_BigInteger.ModPow(arguments: 16384,16384,64 bits) |      1.09 |       2186651.56 |       1998425.00 |         |
ghost commented 4 years ago

Tagging subscribers to this area: @tannergooding, @pgovind See info in area-owners.md if you want to be subscribed.

tannergooding commented 4 years ago

I'm generally happy to take perf PRs, provided all tests pass and numbers look good (both local and those on the perf lab hardware).

speshuric commented 1 year ago

@adamsitnik please take a look. It seems that the issue can be closed with #83457

adamsitnik commented 1 year ago

@adamsitnik please take a look. It seems that the issue can be closed with #83457

It looks like #83457 has included the + and - perf improvements, how about * and / ? It would be great to try them and see what perf benefits we can get. Since this seems to be a small change I can give it a quick try and report back.

speshuric commented 1 year ago

Yes, it need to be clarified. But Karatsuba/Toom-Cook branch already use Add method which contains this improvement.