Closed matthiasgoergens closed 1 year ago
Well, instead of mucking around with bits, we could also just add and subtract 4 (when storing / loading from the indices array). That would save some branching.
Alas, for some reason, I can't seem to make that work. I think there might be some other parts of the code that is relying on the exact bit-representation in dk_indices
. My initial proposal has the benefit of not changing what any of these bits mean. Especially not the bit pattern for the special entries.
If we can fix that one, I think that just adding and subtracting would probably be easier and perhaps a tiny bit faster.
Closing this issue, as it is now tracked in https://github.com/python/cpython/issues/96472
At the moment, dicts vary what
int
datatype they use for their indices depending on their size.As a minor complication, indices are stored as signed integers, and thus they shift to the next bigger
int
size at1<<7
,1<<15
and1<<31
.However, we can double all those boundaries, and thus save one, two or four bytes per index for sizes that fall between the old and revised boundaries.
Background
Indices in dicts are of type
Py_ssize_t
. Non-negative indices are interpreted to point to a place intoDK_ENTRIES
orDK_UNICODE_ENTRIES
. Negative indices indicate one of four special conditions:dk_indices
stores these indices in the smallest int variant that fits all the values.Optimization
The size of
dk_indices
is always a power of 2. But because we want to avoid hash collisions, the size ofDK_ENTRIES
is always a far bit smaller than that. In the current implementation, it's two-thirds of the size ofdk_indices
. But the details don't matter and might be subject to change. What matters is that we always have at least four unused bit patterns.To proceed with an example for a single byte:
In a signed
int8_t
normally every bit pattern lexicographically at or above0b01111111
counts as a negative number. Everything below is positive. That's what C does when casting fromint8_t
toPy_ssize_t
.However, the lowest negative number we need is only
-4
, which corresponds to0b11111100
. And the biggest positive index we need is 169 which corresponds0b10101001
. There's a big gap between both bit patterns, and we can programmatically detect which case we are in. Instead of adding more prose, let's look at the C code:For comparison, the status quo looks like this:
I have a prototype and also ran some benchmarks. Here are two examples:
In this first plot above, you can see that we sometimes save a bit of memory.
This plot shows that the difference in runtime is pretty much a wash.
So overall, a modest memory saving under certain circumstances, but at no cost in speed.
I'm happy to run specific benchmarks, if anyone has some ideas.