Open andrewthad opened 5 years ago
Other possible flavors:
data Scientific = Scientific
(# (# Int#, Int# #) | (# Int#, Integer #) | (# Integer, Int# #) | (# Integer, Integer #)
data Scientific
= SmallSmall !Int !Int
| BigSmall !Integer !Int
| SmallBig !Int !Integer
| BigBig !Integer !Integer
It would also be possible to select just a couple of these options.
In the scientific-notation
libary, I use this:
data Scientific = Scientific
{-# UNPACK #-} !Int -- coefficient
{-# UNPACK #-} !Int -- base-10 exponent, minBound means use unlimited-precision field
LargeScientific
data LargeScientific = LargeScientific
!Integer -- coefficent
!Integer -- exponent
Which is like having only SmallSmall
and BigBig
.
The internal representation of
Scientific
is currently:It's been pointed out on another issue that this is incorrect, but I suspect that, in the process of fixing it, there is an additional optimization that may be worth making. The type currently makes chase a pointer to get to the coefficent, even when the coefficient is something small and fits inside of
S#
. But what if this was done instead?This would only work on GHC 8.6 and higher because using
UnboxedSums
in this way would corrupt codegen before then. However, in the extremely common case in which the coefficient and the exponent are both small (less than 2^32 or 2^64 depending on platform), this can keep all the data on a single cache line. It could be argued that it makes the data constructor take up more space. This representation has a 6-machine-word payload (2 for Int#s, 2 for Integers, and 2 for the sums toggle words), while the more simpledata Scientific = Scientific !Integer !Integer
takes only 2 machine words. However, theS#
data constructors thatInteger
requires for small integers means that we actually end up needing 4 additional machine words. The difference is that things are just more spread across memory. In the case where either the coefficient or the exponent were large, then the representation I suggest here would consume more memory than the naive one and perform slightly worse.Ignoring the abysmal backwards-compatibility issue for the moment, the important question is whether it's better to optimize for small integers or large ones. My preference is small integers since I see
Scientific
as a safe intermediary for lexers and parsers that are, in practice, applied to small numbers most of the time (the prevalent example beingaeson
).If the project author finds this idea agreeable, I'm happy to implement this and verify that it improves the performance of the library. I'd do it in a way that shims out a non-unboxed-sum implementation on older GHCs.