apd: embed small coefficient values in Decimal struct

Fourth time's the charm.

Replaces cockroachdb/cockroach#74369. Replaces https://github.com/cockroachdb/apd/pull/101. Replaces https://github.com/cockroachdb/apd/pull/102.

This is an alternative to #102. At a high-level, it implements the "compact memory representation" described in https://github.com/cockroachdb/apd/pull/102#issuecomment-1005078022. Compared to the approach in that PR, this approach is a) faster for most operations, b) more usable because values can be safely copied, c) half the memory size (32 bytes per Decimal, vs. 64).

The memory representation of the Decimal struct in this approach looks like:

type Decimal struct {
    Form     int8
    Negative bool
    Exponent int32
    Coeff    BigInt {
        _inner  *big.Int // nil when value fits in _inline
        _inline [2]uint
    }
} // sizeof = 32

With a two-word inline array, any value that would fit in a 128-bit integer (i.e. decimals with a scale-adjusted absolute value up to 2^128 - 1) fit in _inline. The indirection through _inner is only used for values larger than this.

Before this change, the memory representation of the Decimal struct looked like:

type Decimal struct {
    Form     int64
    Negative bool
    Exponent int32
    Coeff    big.Int {
        neg bool
        abs []big.Word {
            data uintptr ---------------. 
            len  int64                  v
            cap  int64         [uint, uint, ...] // sizeof = variable, but around cap = 4, so 32 bytes
        }
    }
} // sizeof = 48 flat bytes + variable-length heap allocated array

This commit introduces a performance optimization that embeds small coefficient values directly in their Decimal struct, instead of storing these values in a separate heap allocation. It does so by replacing math/big.Int with a new wrapper type called BigInt that provides an "inline" compact representation optimization.

Each BigInt maintains (through big.Int) an internal reference to a variable-length integer value, which is represented by a []big.Word. The _inline field and the inner and updateInner methods combine to allow BigInt to inline this variable-length integer array within the BigInt struct when its value is sufficiently small. In the inner method, we point a temporary big.Int's nat slice at this _inline array. big.Int will avoid re-allocating this array until it is provided with a value that exceeds the initial capacity. Later in updateInner, we detect whether the array has been re-allocated. If so, we switch to using the _inner. If not, we continue to use this array.

We set the capacity of the inline array to accommodate any value that would fit in a 128-bit integer (i.e. values with an absolute value up to 2^128 - 1).

This is an alternative to an optimization that many other arbitrary precision decimal libraries have where small coefficient values are stored as numeric fields in their data type's struct. Only when this coefficient value gets large enough do these libraries fall back to a variable-length coefficient with internal indirection. We can see the optimization in practice in the ericlagergren/decimal library, where each struct contains a compact uint64 and an unscaled big.Int. Prior concern from the authors of cockroachdb/apd regarding this form of compact representation optimization was that it traded performance for complexity. The optimization fractured control flow, leaking out across the library and leading to more complex, error-prone code.

The approach taken in this commit does not have the same issue. All arithmetic on the decimal's coefficient is still deferred to bit.Int.

Impact on benchmarks:

name                   old time/op    new time/op    delta
GDA/rounding-10           607µs ± 1%     312µs ± 1%   -48.62%  (p=0.000 n=10+10)
GDA/squareroot-10        31.3ms ± 0%    16.4ms ± 1%   -47.47%  (p=0.000 n=7+10)
GDA/powersqrt-10          445ms ± 0%     252ms ± 1%   -43.39%  (p=0.000 n=9+10)
GDA/tointegral-10        31.6µs ± 1%    18.0µs ± 1%   -43.01%  (p=0.000 n=9+10)
GDA/tointegralx-10       31.9µs ± 1%    18.4µs ± 1%   -42.31%  (p=0.000 n=8+9)
GDA/remainder-10         55.4µs ± 1%    32.8µs ± 1%   -40.79%  (p=0.000 n=10+10)
GDA/abs-10               10.1µs ± 1%     6.2µs ± 1%   -38.95%  (p=0.000 n=10+9)
GDA/subtract-10           150µs ± 1%      97µs ± 1%   -35.72%  (p=0.000 n=10+10)
GDA/divideint-10         24.2µs ± 0%    15.9µs ± 1%   -34.24%  (p=0.000 n=9+10)
GDA/randoms-10           2.92ms ± 1%    1.92ms ± 1%   -34.21%  (p=0.000 n=10+10)
GDA/minus-10             11.2µs ± 1%     7.7µs ± 1%   -31.10%  (p=0.000 n=9+10)
GDA/quantize-10           113µs ± 0%      81µs ± 1%   -28.01%  (p=0.000 n=10+10)
GDA/compare-10           40.5µs ± 1%    29.4µs ± 1%   -27.46%  (p=0.000 n=10+9)
GDA/divide-10             365µs ± 1%     272µs ± 1%   -25.47%  (p=0.000 n=9+9)
GDA/reduce-10            17.8µs ± 1%    14.0µs ± 1%   -21.18%  (p=0.000 n=10+10)
GDA/add-10                763µs ± 1%     610µs ± 1%   -19.98%  (p=0.000 n=10+9)
GDA/multiply-10          68.5µs ± 1%    57.0µs ± 1%   -16.84%  (p=0.000 n=9+10)
GDA/comparetotal-10      29.4µs ± 1%    25.3µs ± 2%   -13.82%  (p=0.000 n=9+10)
GDA/plus-10              42.0µs ± 1%    36.8µs ± 1%   -12.33%  (p=0.000 n=9+10)
GDA/cuberoot-apd-10      2.06ms ± 3%    1.99ms ± 1%    -3.75%  (p=0.000 n=10+9)
GDA/exp-10                128ms ± 0%     126ms ± 0%    -1.01%  (p=0.000 n=8+10)
GDA/ln-10                83.9ms ± 0%    86.7ms ± 1%    +3.36%  (p=0.000 n=8+10)
GDA/log10-10              106ms ± 1%     112ms ± 1%    +5.51%  (p=0.000 n=9+10)
GDA/base-10               110µs ± 0%     116µs ± 1%    +5.76%  (p=0.000 n=8+9)
GDA/power-10              218ms ± 0%     231ms ± 1%    +6.00%  (p=0.000 n=9+10)

name                   old alloc/op   new alloc/op   delta
GDA/abs-10               2.33kB ± 0%    0.00kB       -100.00%  (p=0.000 n=10+10)
GDA/compare-10           10.1kB ± 0%     0.0kB       -100.00%  (p=0.000 n=10+10)
GDA/comparetotal-10      7.11kB ± 0%    0.00kB       -100.00%  (p=0.000 n=10+10)
GDA/minus-10             2.43kB ± 0%    0.00kB       -100.00%  (p=0.000 n=10+10)
GDA/reduce-10            2.22kB ± 0%    0.00kB       -100.00%  (p=0.000 n=10+10)
GDA/remainder-10         22.4kB ± 0%     0.1kB ± 0%   -99.57%  (p=0.000 n=10+10)
GDA/rounding-10           246kB ± 0%       3kB ± 0%   -98.75%  (p=0.000 n=10+10)
GDA/divideint-10         6.51kB ± 0%    0.10kB ± 0%   -98.53%  (p=0.000 n=10+10)
GDA/squareroot-10        8.30MB ± 0%    0.31MB ± 0%   -96.28%  (p=0.000 n=10+10)
GDA/randoms-10           1.16MB ± 0%    0.05MB ± 0%   -95.82%  (p=0.000 n=10+10)
GDA/powersqrt-10         77.6MB ± 0%     6.0MB ± 0%   -92.26%  (p=0.000 n=10+10)
GDA/divide-10            74.3kB ± 0%     9.7kB ± 0%   -86.95%  (p=0.000 n=10+10)
GDA/quantize-10          42.3kB ± 0%     9.2kB ± 0%   -78.23%  (p=0.002 n=8+10)
GDA/tointegralx-10       19.2kB ± 0%     6.6kB ± 0%   -65.78%  (p=0.000 n=10+10)
GDA/tointegral-10        19.1kB ± 0%     6.6kB ± 0%   -65.61%  (p=0.000 n=10+10)
GDA/subtract-10           100kB ± 0%      35kB ± 0%   -64.64%  (p=0.000 n=10+10)
GDA/multiply-10          35.5kB ± 0%    15.7kB ± 0%   -55.83%  (p=0.000 n=10+10)
GDA/cuberoot-apd-10       261kB ± 0%     129kB ± 0%   -50.46%  (p=0.000 n=9+10)
GDA/add-10                712kB ± 0%     459kB ± 0%   -35.53%  (p=0.000 n=9+9)
GDA/ln-10                10.2MB ± 0%     7.2MB ± 0%   -29.57%  (p=0.000 n=10+10)
GDA/log10-10             12.5MB ± 0%     9.1MB ± 0%   -27.14%  (p=0.000 n=10+10)
GDA/power-10             27.0MB ± 0%    20.7MB ± 0%   -23.35%  (p=0.000 n=9+10)
GDA/plus-10              46.6kB ± 0%    41.0kB ± 0%   -11.95%  (p=0.000 n=10+10)
GDA/exp-10               61.3MB ± 0%    60.6MB ± 0%    -1.09%  (p=0.000 n=10+10)
GDA/base-10              24.4kB ± 0%    26.2kB ± 0%    +7.35%  (p=0.000 n=10+10)

name                   old allocs/op  new allocs/op  delta
GDA/abs-10                  151 ± 0%         0       -100.00%  (p=0.000 n=10+10)
GDA/compare-10              638 ± 0%         0       -100.00%  (p=0.000 n=10+10)
GDA/comparetotal-10         254 ± 0%         0       -100.00%  (p=0.000 n=10+10)
GDA/minus-10                164 ± 0%         0       -100.00%  (p=0.000 n=10+10)
GDA/reduce-10               187 ± 0%         0       -100.00%  (p=0.000 n=10+10)
GDA/remainder-10            923 ± 0%         2 ± 0%   -99.78%  (p=0.000 n=10+10)
GDA/divideint-10            357 ± 0%         2 ± 0%   -99.44%  (p=0.000 n=10+10)
GDA/rounding-10           9.20k ± 0%     0.10k ± 0%   -98.96%  (p=0.000 n=10+10)
GDA/randoms-10            40.4k ± 0%      0.7k ± 0%   -98.26%  (p=0.000 n=10+10)
GDA/squareroot-10          275k ± 0%        6k ± 0%   -97.88%  (p=0.000 n=10+10)
GDA/quantize-10           1.70k ± 0%     0.05k ± 0%   -96.89%  (p=0.000 n=10+10)
GDA/tointegralx-10          677 ± 0%        48 ± 0%   -92.91%  (p=0.000 n=10+10)
GDA/tointegral-10           665 ± 0%        48 ± 0%   -92.78%  (p=0.000 n=10+10)
GDA/subtract-10           2.85k ± 0%     0.21k ± 0%   -92.63%  (p=0.000 n=10+10)
GDA/powersqrt-10          2.63M ± 0%     0.20M ± 0%   -92.20%  (p=0.000 n=10+10)
GDA/divide-10             2.51k ± 0%     0.20k ± 0%   -91.94%  (p=0.000 n=10+10)
GDA/add-10                13.2k ± 0%      4.2k ± 0%   -68.01%  (p=0.000 n=10+10)
GDA/multiply-10             931 ± 0%       312 ± 0%   -66.49%  (p=0.000 n=10+10)
GDA/cuberoot-apd-10       7.67k ± 0%     2.62k ± 0%   -65.91%  (p=0.000 n=10+10)
GDA/plus-10                 587 ± 0%       260 ± 0%   -55.71%  (p=0.000 n=10+10)
GDA/ln-10                  269k ± 0%      146k ± 0%   -45.71%  (p=0.000 n=10+10)
GDA/log10-10               326k ± 0%      186k ± 0%   -42.98%  (p=0.000 n=10+10)
GDA/power-10               699k ± 0%      423k ± 0%   -39.52%  (p=0.000 n=10+10)
GDA/exp-10                 911k ± 0%      696k ± 0%   -23.62%  (p=0.000 n=10+10)
GDA/base-10               2.11k ± 0%     2.52k ± 0%   +19.14%  (p=0.000 n=10+10)

cockroachdb / apd

apd: embed small coefficient values in Decimal struct #103

Impact on benchmarks: