faster-cpython / ideas

1.67k stars 49 forks source link

More efficient implementation of integers #548

Open markshannon opened 1 year ago

markshannon commented 1 year ago

Moving this from the old discussion to here, as we aren't using discussions any more.

markshannon commented 1 year ago

As mentioned in https://github.com/python/cpython/issues/101265, we need to prevent deallocation of smallints.

To do that we need to mark small ints. We could also use this mark to avoid DECREFs in code using integers, as a slightly faster check than on the refcount as the value will already be in a register.

markshannon commented 1 year ago

Here's a possible layout with mark bit:

typedef struct {
    OBJECT_HEADER;
    intptr_t tagged_bits;
}  PyIntObject;

typedef struct {
    PyIntObject header;
    digit digits[1];
} PyLongObject;

Tagged bits

Bits 63/31 -- 2 1 0
Int Signed value Immortal bit 0
Long Unsigned digit count Sign 1

This allows us to perform the common operations efficiently.

C value of Int

tagged_bits >> 2 (Signed shift)

Sign of Long

(tagged_bits & 2) - 1

Digit count of long

tagged_bits >> 2

Is small Int?

tagged_bits & 1

Are both Ints (for binary ops)

((a->tagged_bits | b->tagged_bits) & 1) == 0

Is immortal?

(tagged_bits & 3) == 2

markshannon commented 1 year ago

One other operation that is a bit less common, but needs consideration is normalization of "longs" to "ints". This differs depending on the number of bits in the word.

For efficiency, it is probably OK to allow some values to be supported in both formats, and use a simple approximation in number of digits. However not all single digit ints fit in the tagged bits.

Digits 32 bit machine 64 bit machine
0 Zero Zero
1 -2**29 to 2**29-1 fit in tagged bits All fit
2 Do not fit All fit
3 Do not fit A few fit, but it is more efficient to reject these during normalization
4+ Do not fit Do not fit

In other words, it is probably best not to have a normal form for values near the threshold, to allow efficient implementation.

markshannon commented 1 year ago

Given that we don't need ridiculously large ints, we could drop a further bit to allow immortal "longs".

Tagged bits

Bits 63/31 -- 3 2 1 0
Int Signed value Signed value (low bit) Immortal bit 0
Long Unsigned digit count Immortal bit Sign 1
jneb commented 1 year ago

Just a thought. Would it help to tweak the bits so that you can find the sign and/or immortality without bothering if it is short or long? If you put the sign in the high bit for longs, you would get the sign check for free. A comparison could check the sign bits first, and if they're different, be ready quickly. I think the best way to make this decision is too see what operations on int are performed in real life. Most will be addition of 1 to a small int, I'm pretty sure :-)

lpereira commented 1 year ago

Maybe it's also time to reconsider dropping 15 and 30 bit digits and go with 63-bit digits only instead?

jneb commented 1 year ago

(This remark was the result of me not realizing that it was about "digits" and not "numbers". Feel free to ignore.) You're joking, right? I am certainly not the only one who uses long integers more often than not. (Edited) I get it now. The size of the digits becomes 63 bits, not the numbers. Good idea, assuming that multiplication is doable on all platforms.

lpereira commented 1 year ago

I'm not joking! I've spent quite a bit of time last year looking over longobject.c and looking for ways to make it more efficient and/or use less memory.

For instance, I don't think that 15-bit digits are that useful anymore, with 32-bit platforms becoming less and less relevant every passing CPU generation; even with only a 30-bit digit implementation, the performance loss on those platforms would be mostly negligible compared to the cleanup of using a single digit width.

With time, moving from a 30-bit digit implementation to a 63-bit digit implementation should even improve performance a tad on 64-bit platforms -- by performing fewer iterations on most long operations and whatnot. It shouldn't increase memory usage significantly either; in most cases it should remain the same (e.g., a 1024-bit long would use 32 30-bit digits, and 16 63-bit digits, which would be 128 bytes either way).

If moving to a 63-bit digit, we would not have something akin to twodigits or stwodigits, though, as there's no standard 128-bit integer type in C11; this would change how carry and borrow is calculated, but for the better: functions like __builtin_add_overflow() (and similar) that leverage overflow/carry flags from CPUs could be used instead, reducing some masking/shifting in the fast path for most operations.

Even if we decide that it's not yet time to get rid of 16- and 32-bit math in long integers, it might be an interesting exercise to at least remove the 16-bit math and keep only the 32-bit math. This would allow, for instance, using intrinsics such as _addcarry_u32() and _subborrow_u32(), on x86 platforms, and avoid shifting/masking. The throughput/latency of a single ADC/SBB is better than that of an AND and SHR combined, especially since there wouldn't be a dependency to stall the pipeline. Since we don't use the full width of a digit to more easily (and portably) detect overflows, this could also mean that it might be possible to have 32-bit digits, potentially slightly reducing memory usage as well.

Of course, this goes a bit into non-portable territory, but implementations of these functions can be written in either assembly for some major CPU arches (e.g. aarch64), or portable C as a fallback.

gvanrossum commented 1 year ago

(Maybe @jneb thought you were proposing to limit int size to 63 bits? If someone proposed this I'd also assume they were joking. :-)

lpereira commented 1 year ago

(Maybe @jneb thought you were proposing to limit int size to 63 bits? If someone proposed this I'd also assume they were joking. :-)

That makes a whole lot more sense, and I agree :D

markshannon commented 1 year ago

Speeding up arithmetic.

Python int arithmetic is very slow. We need to do a lot of work, checking the size and shape of ints before adding them, and then yet more work boxing them.

We want to minimize the amount of work we do. Many arithmetic expressions consume temporary values which we can reuse, if we discard "small ints" and use information that the compiler can gather to determine whether an operation is one of the forms:

In the first case, the refcount of temporary_value is likely to be 1, and we can reuse the object. In the second case, if the refcount of var is 2, which is likely, we can reuse the object.

The above is impossible with "small ints" because if the result is a small int, we need to use the immortal small object and free the temporary.

We can still have small ints. There are useful in lots of places, but we need to get rid of the requirement that we must use them, that (10 + 10) is 20.

What the specialized form would look like

We want the specialization of + for ints, BINARY_OP_ADD_INT_REUSE_LEFT, to be fully inline and do no allocation when the lhs has the expected refcount: Something like:

        inst(BINARY_OP_ADD_INT_REUSE_LEFT, (unused/1, left, right -- sum)) {
            assert(cframe.use_tracing == 0);
            DEOPT_IF(!PyLong_CheckExact(left), BINARY_OP);
            DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
            DEOPT_IF(Py_REFCNT(left) != 1, BINARY_OP); /* Or two for the x += ... case */
            DEOPT_IF((left->tagged_bits | right->tagged_bits) & 3, BINARY_OP); /* Both are mortal ints, not longs */
            DEOPT_IF((add_overflows(left->tagged_bits, right->tagged_bits), BINARY_OP);
            STAT_INC(BINARY_OP, hit);
            left->tagged_bits = (left->tagged_bits + right->tagged_bits) & ~(1<<2);
            _Py_DECREF_SPECIALIZED(right, (destructor)PyObject_Free);
        }

We attempted this before, but the code gets messy and very branchy handling all the special cases due to small ints, and refcounts.

We will want a bit of help from the compiler, to mark which binary ops are of the form x += ... or x = x + ..., that way we will know whether check for refcount 1 or 2 at specialization and runtime.

Prototyping on floats

We can implement the reference counting specializations first for floats which don't have to worry about overflow or immortal values. This should tell us whether we can reuse objects often enough for this sort of refcounting optimization will be beneficial.

markshannon commented 1 year ago

CPython issue: https://github.com/python/cpython/issues/101291

cdimauro commented 1 year ago

For a slightly better longs representation you might consider this: https://ep2011.europython.eu/conference/talks/hacking-pylongobject-on-python-32.html The idea was/is to have the common "small" ints represented as usual, in two-complements (and further optimization in future, considering this new representation). The suggestion was to set 30 bits (which was done recently on CPython) as the standard digits size and consider 60 bits for 64-bit machines.

The above work can be combined with the introduction of "direct" ints which I made son WPython 1.1: https://code.google.com/archive/p/wpython2/downloads WPython 1.1 already addressed the removal or some DECREFs.

igorgatis commented 6 months ago

For the sake of new comers (like me), what is the context of immortal ints? Are them known in advance (e.g. compile time)?

In time, I was wondering whether immortality (for any object) could be stored out side of the integer bits, keeping them as close to C implementation as possible.

I see two ways to store immortality status outside of the object:

  1. Using "unused" lowest bits of the PyObject* pointer. Given the size of PyObject, perhaps we can guarantee that it will always be at least 2-byte aligned, sparing the lowest bit which 0 is mortal and 1 for immortal. The major downside is the fact that we would need to mask before referencing its fields.
  2. Using a dedicate memory region for immortal objects. This approach is safe then the one above and the check is as simple as MIN_IMMORTAL_ADDR <= addr < MAX_IMMORTAL_ADDR. But it really depends on the nature of those immortal objects.

Perhaps the pointer parity trick could be used for long integers sign as well.

Does it make sense?