apd: optimize Quo, remove per-digit long division

Fixes #98.

This commit speeds up Context.Quo, replacing its per-digit long division with a call to BigInt.QuoRem. This avoids per-digit iteration. It also allows Context.Quo to benefit further from the inline fast-path of BigInt.QuoRem for small coefficient values that fit in a uint64.

This is a partial revert of 1eddda3, at least in spirit. Before that commit, Context.Quo did try to use big.Int.Quo. Unfortunately, it was inaccurate for certain values because it was not scaling the coefficients correctly before dividing. It tried to compensate for this by using a very large precision (c.Precision*2 + 8), but this was insufficient on its own because it was not taking the size of the values into account. So the commit abandoned that approach.

However, that commit, which was based on the description given on the GDA site, did demonstrate how to scale the coefficients in a way that would permit this kind of division. Specifically, it began adjusting the operands so that the coefficient of the dividend was greater than or equal to the coefficient of the divisor and was also less than ten times the coefficient of the divisor.

This commit uses the coefficient adjustment introduced in 1eddda3 to revive the call into BigInt.QuoRem. With the two operands adjusted, it now is possible to scale the coefficient of the dividend by the desired precision and then perform a direct division on the operands.

Microbenchmarks

name                 old time/op    new time/op    delta
GDA/exp-10              116ms ± 0%      21ms ± 0%  -82.28%  (p=0.000 n=9+9)
GDA/cuberoot-apd-10    1.76ms ± 0%    0.40ms ± 0%  -77.47%  (p=0.000 n=9+10)
GDA/log10-10            102ms ± 0%      25ms ± 0%  -74.99%  (p=0.000 n=10+9)
GDA/ln-10              78.3ms ± 0%    19.8ms ± 1%  -74.69%  (p=0.000 n=10+10)
GDA/power-10            212ms ± 0%      56ms ± 0%  -73.46%  (p=0.000 n=9+9)
GDA/divide-10           229µs ± 0%      65µs ± 1%  -71.72%  (p=0.000 n=10+10)
GDA/powersqrt-10        190ms ± 0%      74ms ± 0%  -61.19%  (p=0.000 n=10+10)
GDA/squareroot-10      12.2ms ± 0%     5.2ms ± 0%  -57.32%  (p=0.000 n=8+9)
GDA/rounding-10         249µs ± 0%     166µs ± 0%  -33.35%  (p=0.000 n=8+10)
GDA/randoms-10         1.70ms ± 0%    1.36ms ± 0%  -20.06%  (p=0.000 n=9+10)
GDA/reduce-10          10.3µs ± 1%    10.2µs ± 1%   -1.24%  (p=0.000 n=10+10)
GDA/remainder-10       26.5µs ± 0%    26.4µs ± 0%   -0.42%  (p=0.000 n=10+10)
GDA/base-10             112µs ± 0%     112µs ± 0%   -0.21%  (p=0.004 n=10+10)
GDA/abs-10             4.48µs ± 1%    4.48µs ± 3%     ~     (p=0.154 n=9+9)
GDA/add-10              550µs ± 1%     551µs ± 1%     ~     (p=0.400 n=10+9)
GDA/compare-10         25.6µs ± 1%    25.7µs ± 1%     ~     (p=0.101 n=10+8)
GDA/comparetotal-10    23.9µs ± 0%    23.9µs ± 1%     ~     (p=0.234 n=9+10)
GDA/divideint-10       13.3µs ± 1%    13.3µs ± 0%     ~     (p=0.363 n=10+10)
GDA/minus-10           5.98µs ± 0%    5.98µs ± 1%     ~     (p=0.196 n=10+10)
GDA/multiply-10        50.1µs ± 0%    50.1µs ± 0%     ~     (p=0.481 n=10+10)
GDA/plus-10            33.1µs ± 0%    33.1µs ± 1%     ~     (p=0.315 n=10+10)
GDA/quantize-10        68.2µs ± 0%    68.5µs ± 1%     ~     (p=0.497 n=9+10)
GDA/subtract-10        81.0µs ± 1%    80.8µs ± 1%     ~     (p=0.219 n=9+10)
GDA/tointegral-10      16.0µs ± 1%    16.0µs ± 1%     ~     (p=0.888 n=9+10)
GDA/tointegralx-10     16.5µs ± 1%    16.6µs ± 1%     ~     (p=0.447 n=9+10)

name                 old alloc/op   new alloc/op   delta
GDA/exp-10             60.6MB ± 0%     9.6MB ± 0%  -84.21%  (p=0.000 n=10+10)
GDA/squareroot-10       307kB ± 0%     145kB ± 0%  -52.90%  (p=0.000 n=9+10)
GDA/cuberoot-apd-10     129kB ± 0%     122kB ± 0%   -5.66%  (p=0.000 n=10+10)
GDA/abs-10              0.00B          0.00B          ~     (all equal)
GDA/add-10              459kB ± 0%     459kB ± 0%     ~     (p=0.504 n=10+10)
GDA/base-10            26.2kB ± 0%    26.2kB ± 0%     ~     (all equal)
GDA/compare-10          0.00B          0.00B          ~     (all equal)
GDA/comparetotal-10     0.00B          0.00B          ~     (all equal)
GDA/divideint-10        96.0B ± 0%     96.0B ± 0%     ~     (all equal)
GDA/minus-10            0.00B          0.00B          ~     (all equal)
GDA/multiply-10        15.7kB ± 0%    15.7kB ± 0%     ~     (all equal)
GDA/plus-10            41.0kB ± 0%    41.0kB ± 0%     ~     (all equal)
GDA/quantize-10        9.20kB ± 0%    9.20kB ± 0%     ~     (all equal)
GDA/randoms-10         48.5kB ± 0%    48.5kB ± 0%     ~     (all equal)
GDA/reduce-10           0.00B          0.00B          ~     (all equal)
GDA/remainder-10        96.0B ± 0%     96.0B ± 0%     ~     (all equal)
GDA/rounding-10        3.07kB ± 0%    3.07kB ± 0%     ~     (all equal)
GDA/subtract-10        35.2kB ± 0%    35.2kB ± 0%     ~     (all equal)
GDA/tointegral-10      6.58kB ± 0%    6.58kB ± 0%     ~     (all equal)
GDA/tointegralx-10     6.58kB ± 0%    6.58kB ± 0%     ~     (all equal)
GDA/powersqrt-10       6.01MB ± 0%    6.27MB ± 0%   +4.28%  (p=0.000 n=10+10)
GDA/power-10           20.7MB ± 0%    22.6MB ± 0%   +9.20%  (p=0.000 n=10+10)
GDA/divide-10          9.70kB ± 0%   10.59kB ± 0%   +9.24%  (p=0.000 n=10+10)
GDA/ln-10              7.16MB ± 0%    7.91MB ± 0%  +10.42%  (p=0.000 n=10+10)
GDA/log10-10           9.12MB ± 0%   10.27MB ± 0%  +12.58%  (p=0.000 n=10+10)

name                 old allocs/op  new allocs/op  delta
GDA/exp-10               696k ± 0%      127k ± 0%  -81.71%  (p=0.000 n=10+10)
GDA/squareroot-10       5.81k ± 0%     3.86k ± 0%  -33.57%  (p=0.000 n=10+10)
GDA/cuberoot-apd-10     2.62k ± 0%     2.43k ± 0%   -7.03%  (p=0.000 n=10+10)
GDA/abs-10               0.00           0.00          ~     (all equal)
GDA/add-10              4.21k ± 0%     4.21k ± 0%     ~     (all equal)
GDA/base-10             2.52k ± 0%     2.52k ± 0%     ~     (all equal)
GDA/compare-10           0.00           0.00          ~     (all equal)
GDA/comparetotal-10      0.00           0.00          ~     (all equal)
GDA/divideint-10         2.00 ± 0%      2.00 ± 0%     ~     (all equal)
GDA/minus-10             0.00           0.00          ~     (all equal)
GDA/multiply-10           312 ± 0%       312 ± 0%     ~     (all equal)
GDA/plus-10               260 ± 0%       260 ± 0%     ~     (all equal)
GDA/quantize-10          53.0 ± 0%      53.0 ± 0%     ~     (all equal)
GDA/randoms-10            704 ± 0%       704 ± 0%     ~     (all equal)
GDA/reduce-10            0.00           0.00          ~     (all equal)
GDA/remainder-10         2.00 ± 0%      2.00 ± 0%     ~     (all equal)
GDA/rounding-10          96.0 ± 0%      96.0 ± 0%     ~     (all equal)
GDA/subtract-10           210 ± 0%       210 ± 0%     ~     (all equal)
GDA/tointegral-10        48.0 ± 0%      48.0 ± 0%     ~     (all equal)
GDA/tointegralx-10       48.0 ± 0%      48.0 ± 0%     ~     (all equal)
GDA/powersqrt-10         205k ± 0%      210k ± 0%   +2.33%  (p=0.000 n=10+10)
GDA/divide-10             202 ± 0%       213 ± 0%   +5.45%  (p=0.000 n=10+10)
GDA/power-10             423k ± 0%      450k ± 0%   +6.51%  (p=0.000 n=10+10)
GDA/ln-10                146k ± 0%      157k ± 0%   +7.43%  (p=0.000 n=10+10)
GDA/log10-10             186k ± 0%      204k ± 0%   +9.61%  (p=0.000 n=10+10)

From the benchmark referenced in #98:

The reported results were:

Benchmark/1_Ops:_apd.Decimal128.Add-16           5285344           218 ns/op
Benchmark/1_Ops:_apd.Decimal128.Sub-16           5665453           208 ns/op
Benchmark/1_Ops:_apd.Decimal128.Mul-16           9138602           138 ns/op
Benchmark/1_Ops:_apd.Decimal128.Quo-16            162537          7790 ns/op

After #103 and this PR, they are now:

Benchmark/1_Ops:_Decimal128.Add-10  11128542           109.5 ns/op
Benchmark/1_Ops:_Decimal128.Sub-10  12875046            94.40 ns/op
Benchmark/1_Ops:_Decimal128.Mul-10  16563775            75.21 ns/op
Benchmark/1_Ops:_Decimal128.Quo-10   3055980           398.4 ns/op

cockroachdb / apd

apd: optimize Quo, remove per-digit long division #114

Microbenchmarks