This PR adds a new method for applying significand truncation to floating-point numbers. It applies the following to compute a truncated value:
x_t = 2^m x - (2^m - 1) x
where x is the value to truncate, m is the number of bits to truncate (m = 52 - n where n is the number of bits remaining in the significand) and x_t is the truncated value.
This scheme operates somewhat faster than the bitwise scheme (about 2/3 of the runtime is a simple benchmark), but is susceptible to overflow errors if working with extremely large values truncated to a small number of bits in the significand.
The scheme is currently opt-in due to both the (small) risk of overflow, and due to a slightly different rounding scheme (round to nearest, tie to even).
This PR adds a new method for applying significand truncation to floating-point numbers. It applies the following to compute a truncated value:
where
x
is the value to truncate,m
is the number of bits to truncate (m = 52 - n
wheren
is the number of bits remaining in the significand) andx_t
is the truncated value.This scheme operates somewhat faster than the bitwise scheme (about 2/3 of the runtime is a simple benchmark), but is susceptible to overflow errors if working with extremely large values truncated to a small number of bits in the significand.
The scheme is currently opt-in due to both the (small) risk of overflow, and due to a slightly different rounding scheme (round to nearest, tie to even).