faster-cpython / ideas

1.69k stars 49 forks source link

Save a few bits in dicts #455

Closed matthiasgoergens closed 1 year ago

matthiasgoergens commented 2 years ago

At the moment, dicts vary what int datatype they use for their indices depending on their size.

As a minor complication, indices are stored as signed integers, and thus they shift to the next bigger int size at 1<<7, 1<<15 and 1<<31.

However, we can double all those boundaries, and thus save one, two or four bytes per index for sizes that fall between the old and revised boundaries.

Background

Indices in dicts are of type Py_ssize_t. Non-negative indices are interpreted to point to a place into DK_ENTRIES or DK_UNICODE_ENTRIES. Negative indices indicate one of four special conditions:

#define DKIX_EMPTY (-1)
#define DKIX_DUMMY (-2)  /* Used internally */
#define DKIX_ERROR (-3)
#define DKIX_KEY_CHANGED (-4) /* Used internally */

dk_indices stores these indices in the smallest int variant that fits all the values.

Optimization

The size of dk_indices is always a power of 2. But because we want to avoid hash collisions, the size of DK_ENTRIES is always a far bit smaller than that. In the current implementation, it's two-thirds of the size of dk_indices. But the details don't matter and might be subject to change. What matters is that we always have at least four unused bit patterns.

To proceed with an example for a single byte:

In a signed int8_t normally every bit pattern lexicographically at or above 0b01111111 counts as a negative number. Everything below is positive. That's what C does when casting from int8_t to Py_ssize_t.

However, the lowest negative number we need is only -4, which corresponds to 0b11111100. And the biggest positive index we need is 169 which corresponds 0b10101001. There's a big gap between both bit patterns, and we can programmatically detect which case we are in. Instead of adding more prose, let's look at the C code:

#define DKIX_LOWEST_RESERVED (-4) /* Used internally */
static inline Py_ssize_t
dictkeys_get_index(const PyDictKeysObject *keys, Py_ssize_t i)
{
    int log2size = DK_LOG_SIZE(keys);
    Py_ssize_t ix;

    if (log2size <= 8) {
        const uint8_t *indices = (const uint8_t*)(keys->dk_indices);
        const uint8_t uix = indices[i];
        if(uix >= (uint8_t)DKIX_LOWEST_RESERVED) {
            ix = (int8_t) uix;
        } else {
            ix = uix;
        }
    }

For comparison, the status quo looks like this:

static inline Py_ssize_t
dictkeys_get_index(const PyDictKeysObject *keys, Py_ssize_t i)
{
    int log2size = DK_LOG_SIZE(keys);
    Py_ssize_t ix;

    if (log2size < 8) {
        const int8_t *indices = (const int8_t*)(keys->dk_indices);
        ix = indices[i];
    }

I have a prototype and also ran some benchmarks. Here are two examples:

dict_bit_mem

In this first plot above, you can see that we sometimes save a bit of memory.

dict_bit

This plot shows that the difference in runtime is pretty much a wash.

So overall, a modest memory saving under certain circumstances, but at no cost in speed.

I'm happy to run specific benchmarks, if anyone has some ideas.

matthiasgoergens commented 2 years ago

Well, instead of mucking around with bits, we could also just add and subtract 4 (when storing / loading from the indices array). That would save some branching.

Alas, for some reason, I can't seem to make that work. I think there might be some other parts of the code that is relying on the exact bit-representation in dk_indices. My initial proposal has the benefit of not changing what any of these bits mean. Especially not the bit pattern for the special entries.

If we can fix that one, I think that just adding and subtracting would probably be easier and perhaps a tiny bit faster.

markshannon commented 1 year ago

Closing this issue, as it is now tracked in https://github.com/python/cpython/issues/96472