vstinner commented 1 month ago

API designed with the help of @skirpichev who works on gmpy2.

@skirpichev ran benchmarks on gmpy2 and it seems like the performance overhead is acceptable. The performance is almost the same compared to the code accessing directly to Python int (PyLongObject) internals.

PyLong_LAYOUT

Data needed by GMP import-export functions.

typedef struct PyLongLayout {
    // Bits per digit
    uint8_t bits_per_digit;

    // Digit size in bytes
    uint8_t digit_size;

    // Digits order:
    // * 1 for most significant digit first (big endian)
    // * -1 for least significant digit first (little endian)
    int8_t digits_order;

    // Endian:
    // * 1 for most significant byte first (big endian)
    // * -1 for least significant byte first (little endian)
    int8_t endian;
} PyLongLayout;

PyAPI_FUNC(const PyLongLayout*) PyLong_GetNativeLayout(void);

Export: PyLong_AsDigitArray()

Export a Python int as an array of digits.

On CPython, no memory copy is needed, it's just a thin wrapper to expose Python int internal array of digits. The advantage is that PyLong_DigitArray.obj stores a strong reference to the Python int object, to make sure that that structure remains valid until PyLong_FreeDigitArray() is called.

typedef struct PyLong_DigitArray {
    PyObject *obj;
    int negative;
    size_t ndigits;
    const void *digits;    
} PyLong_DigitArray;

PyAPI_FUNC(int) PyLong_AsDigitArray(
    PyObject *obj,
    PyLong_DigitArray *array);
PyAPI_FUNC(void) PyLong_FreeDigitArray(
    PyLong_DigitArray *array);

In practice, PyLong_DigitArray.layout is just the native layout: PyLong_GetNativeLayout().

Import: PyLongWriter_Create()

Create a Python int object from an array of digits.

It's a thin wrapper to the private _PyLong_New() function. PyLongWriter_Finish() takes care of normalizing the digit and convert the object to a compact long if needed.

typedef struct PyLongWriter PyLongWriter;

PyAPI_FUNC(PyLongWriter*) PyLongWriter_Create(
    int negative,
    size_t ndigits,
    void **digits);
PyAPI_FUNC(PyObject*) PyLongWriter_Finish(PyLongWriter *writer);
PyAPI_FUNC(void) PyLongWriter_Discard(PyLongWriter *writer);

Discussions

PR
Issue

Benchmarks (one, two, ~ten, ~hundred digits) for gmpy2.

For export

On the master:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 233 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 295 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<300' 'mpz(x)'
Mean +- std dev: 491 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<3000' 'mpz(x)'
Mean +- std dev: 2.33 us +- 0.01 us

With new API:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 263 ns +- 37 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 315 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<300' 'mpz(x)'
Mean +- std dev: 508 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<3000' 'mpz(x)'
Mean +- std dev: 2.34 us +- 0.01 us

Without special case for 1-digit:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 291 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 316 ns +- 1 ns

Using PyUnstable_Long_IsCompact (like in proposed above code):

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 215 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 317 ns +- 2 ns

New API overhead seems to be noticeable for 1-2 digits case (up to 10-15%).

For import

On the master:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<7)' 'int(x)'
Mean +- std dev: 111 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<38)' 'int(x)'
Mean +- std dev: 155 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<300)' 'int(x)'
Mean +- std dev: 473 ns +- 6 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<3000)' 'int(x)'
Mean +- std dev: 2.20 us +- 0.00 us

With new API:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<7)' 'int(x)'
Mean +- std dev: 111 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<38)' 'int(x)'
Mean +- std dev: 155 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<300)' 'int(x)'
Mean +- std dev: 512 ns +- 5 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<3000)' 'int(x)'
Mean +- std dev: 2.25 us +- 0.02 us

zooba commented 1 month ago

If PyLong_GetDigits returned the layout information with its call, then it could return ndigits=sizeof(ssize_t); bits_per_digit=8; digits=&compact_value for a compact value, and you shouldn't need the additional checks. The layout could be a pointer to a static struct easily enough, so it's not costly to return, though I prefer the stability of returning a pointer that requires a "free" call of some sort.

Otherwise, if we were to document this function as pairing with PyUnstable_Long_IsCompact then it probably should also be unstable. But I think it's better to leave that optimisation unspecified, if it's not needed for correctness.

PyAPI_DATA(const PyLongLayout) PyLong_LAYOUT;

Please, no more exported data. Make it a function that either returns a pointer to the layout, a function that copies the layout values into a struct, or integrate it with GetDigits as I suggested so that we can support multiple layouts.

// Array endian: // * 1 for most significant byte first (big endian) // * -1 for least significant first (little endian) int8_t array_endian;

Does this meant "most significant digit first"? Or "most significant byte of each digit first"? Or both?

As we'd now use Py_digit in the ABI, is there a need to also provide its size via a function? Or can we rely on callers understanding how their C ABI works? (Or can we drop Py_digit from the ABI and just work with void * instead? In which case we must provide the digit size some other way.)

skirpichev commented 1 month ago

Does this meant "most significant digit first"? Or "most significant byte of each digit first"? Or both?

No, it's about endianness in the digit. We have word_endian to represent the order of digits.

Probably naming could be better. E.g. endian instead of array_endian and/or digits_order instead of word_endian. It seems, CPython consistently uses term "digit" instead of "word" (as libmpdec) or "limb" (as GMP).

Or can we drop Py_digit from the ABI and just work with void * instead?

That seems to be a good idea. PyLongLayout has digit_size field.

encukou commented 1 month ago

I agree with @zooba, and I still stand by my earlier comment. I don't want to lock CPython -- and any future C API implementors -- into using a single layout for all ints.

vstinner commented 1 month ago

Do you mean adding a layout member to PyLong_DigitArray (export) and a layout parameter to PyLongWriter_Create() (import)?

typedef struct PyLong_DigitArray {
    ...
    const PyLongLayout *layout;  // <== NEW
} PyLong_DigitArray;

PyAPI_FUNC(int) PyLong_AsDigitArray(
    PyObject *obj,
    PyLong_DigitArray *array);

PyAPI_FUNC(PyLongWriter*) PyLongWriter_Create(
    ...
    const PyLongLayout *layout);  // <== NEW

skirpichev commented 1 month ago

I don't want to lock CPython -- and any future C API implementors -- into using a single layout for all ints.

JFR, I don't think this locks CPython in any way: this single layout is an asymptotic one (for "big" ints). The API has enough freedom to emulate this view for small integers.

Suggestion @zooba might work with API like (reading case):

typedef struct PyLongLayout {
    uint8_t bits_per_digit;
    uint8_t digit_sizeof;
    int8_t digits_order;
    int8_t digit_endianness;
    Py_ssize_t ndigits;
} PyLongLayout;

/* Returns a pointer to digit array (or NULL, if obj not PyLongObject) and
   set layout.  Which might be different wrt bits_per_digit and digit_sizeof for
   "small" and "big" ints. */
PyAPI_FUNC(const void *) PyLong_AsDigits(PyObject *obj, PyLongLayout *layout);

vstinner commented 1 month ago

The API has enough freedom to emulate this view for small integers.

Currently, exporting a small integer as an array of digits is a O(1) operation. There is no need to allocate memory or anything, all integers are implemented internally as an array of digits.

encukou commented 1 month ago

OK. Let's design API for what we have. If we need additional layouts, that kind of API can be added in the future, and if/when we add it, the API added here will start copying data. And if a non-CPython implementation uses a more elaborate scheme, it'll need to copy data too. I could live with that. Let's design it that way then.

PyLongLayout should be named PyLong_DigitArrayLayout, since the actual internal layout might change in the future.

Same with the writer: PyLong_DigitArrayWriter_Create & PyLong_DigitArrayWriter_Finish.

We also need a PyLong_DigitArrayWriter_Discard. Finish is trivial now, but if it can do a memory copy in the future, users should call Discard in error paths.

We don't need endianness in the struct. I think something like a new medium-int representation (which would need a new API), is much more likely than switching to non-native endianness. (Endian flags make sense if you want to serialize data, and can ask for a particular format, but the in-memory format can just always be native.)

As for avoiding Py_DATA, I filed https://github.com/capi-workgroup/problems/issues/80. I'm OK with a PyLong_GetDigitArrayLayout function rather than a PyLong_LAYOUT constant (or maybe in addition to a PyLong_DIGIT_ARRAY_LAYOUT constant?) but it's not clear to me why it's needed.

@skirpichev said:

I would prefer more simple and less generic API, like proposed in https://github.com/capi-workgroup/decisions/issues/31. E.g. for reading:

  if (PyLong_IsCompact(obj)) { /* should be a quick check */
      mpz_set_si(z, PyLong_CompactValue(obj));  /* compact value fits in some integer type */
  }
  else {
      /* fallback to "array of digits" view for abs(obj) */
      const Py_digit *digits = PyLong_GetDigits(obj);  /* can't fail for non-compact ints */
      mpz_import(z, ..., digits);
      if (PyLong_IsNegative(obj) { /* looks better than PyLong_GetSign() ;) */
          mpz_neg(z, z);
      } 
  }

Would it work if you replace PyLong_IsCompact and PyLong_CompactValue with PyLong_AsNativeBytes and a digit_size buffer? That will do extra work for long ints, but the amount of extra work might be small enough (and offset by only calling one function). If it's not fast enough, maybe we should add a Py_ASNATIVEBYTES_COMPACT_ONLY flag, which would make PyLong_AsNativeBytes leave buffer untouched and return 0 when you should use PyLong_AsDigitArray.

Looking at the above example: most callers will need the sign bit, so PyLong_AsDigitArray should return 1 for negative numbers. (Or it could get a int *sign output argument like with PyLong_GetSign, but distinguishing 0 from positives doesn't sound too useful here.)

skirpichev commented 1 month ago

We don't need endianness in the struct.

You meant word_endian too?

Would it work if you replace PyLong_IsCompact and PyLong_CompactValue with PyLong_AsNativeBytes and a digit_size buffer?

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 263 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'  # huh
Mean +- std dev: 350 ns +- 2 ns

patch wrt gmpy2 test branch

```diff diff --git a/src/gmpy2_convert_gmp.c b/src/gmpy2_convert_gmp.c index 34f2546..e732080 100644 --- a/src/gmpy2_convert_gmp.c +++ b/src/gmpy2_convert_gmp.c @@ -44,24 +44,25 @@ static void mpz_set_PyLong(mpz_t z, PyObject *obj) { - static PyLong_DigitArray long_export; - - PyLong_AsDigitArray(obj, &long_export); - if (long_export.ndigits == 1) { - mpz_set_si(z, long_export.digits[0]); + long value; + Py_ssize_t bytes = PyLong_AsNativeBytes(obj, &value, sizeof(long), -1); + if (0 <= bytes && bytes <= (Py_ssize_t)sizeof(long)) { + mpz_set_si(z, value); } else { + static PyLong_DigitArray long_export; + PyLong_AsDigitArray(obj, &long_export); mpz_import(z, long_export.ndigits, PyLong_LAYOUT.array_endian, PyLong_LAYOUT.digit_size, PyLong_LAYOUT.word_endian, PyLong_LAYOUT.digit_size*8 - PyLong_LAYOUT.bits_per_digit, long_export.digits); + if (long_export.negative) { + mpz_neg(z, z); + } + PyLong_FreeDigitArray(&long_export); } - if (long_export.negative) { - mpz_neg(z, z); - } - PyLong_FreeDigitArray(&long_export); } static MPZ_Object * ```

Looking at the above example: most callers will need the sign bit

Not necessary. This view rather represents the absolute value.

encukou commented 1 month ago

You meant word_endian too?

Yes. I see no reason to single that out as a detail that might change.

Mean +- std dev: 263 ns +- 2 n Mean +- std dev: 350 ns +- 2 ns

Hm, I guess that's not good enough? Does the Py_ASNATIVEBYTES_COMPACT_ONLY idea sound reasonable?

most callers will need the sign bit

I meant that instead of

        int result = PyLong_AsDigitArray(obj, &long_export);
        if (result < 0) goto error;
        mpz_import(...);
        if (long_export.negative) {
            mpz_neg(z, z);
        }
        PyLong_FreeDigitArray(&long_export);

you'd do

        int sign = PyLong_AsDigitArray(obj, &long_export);
        if (sign < 0) goto error;
        mpz_import(...);
        if (sign) {
            mpz_neg(z, z);
        }
        PyLong_FreeDigitArray(&long_export);

skirpichev commented 1 month ago

Hm, I guess that's not good enough?

Yes, especially the second case. You can see other benchmarks above.

Does the Py_ASNATIVEBYTES_COMPACT_ONLY idea sound reasonable?

Yes, if PyLong_AsNativeBytes can do a quick exit - that should be an alternative to PyLong_IsCompact/CompactValue.

int sign = PyLong_AsDigitArray(obj, &long_export);

This looks as a minor convenience for me. I think to query (optionally) sign people should use dedicated functions (like PyLong_IsNegative()).

encukou commented 1 month ago

Oh, I forgot we already have PyUnstable_Long_IsCompact exposed. That's the check a Py_ASNATIVEBYTES_COMPACT_ONLY would do in the current implementation.

If you're chasing nanoseconds, I don't think you want to compile for stable ABI, and that you will want to change your code if the internals change. So, PyUnstable looks OK.

(That said, consider having the optimizations in #ifndef Py_LIMITED_API, and releasing stable-ABI wheels alongside the version-specific fast ones. Testers of CPython alpha releases would thank you, and hopefully, eventually, so would PyPy users.)

vstinner commented 1 month ago

@skirpichev:

Probably naming could be better. E.g. endian instead of array_endian and/or digits_order instead of word_endian. It seems, CPython consistently uses term "digit" instead of "word" (as libmpdec) or "limb" (as GMP).

I renamed word_endian to digits_order and array_endian to endian.

@encukou:

I agree with @zooba, and I still stand by my https://github.com/capi-workgroup/decisions/issues/31#issuecomment-2188269464. I don't want to lock CPython -- and any future C API implementors -- into using a single layout for all ints.

Ok, I modified the API to add layout parameters to import and export APIs.

Replace PyLong_LAYOUT constant with PyLong_GetNativeLayout() function.
Add layout parameter to PyLongWriter_Create().
Add layout member to PyLong_DigitArray structure.

vstinner commented 1 month ago

@skirpichev:

This looks as a minor convenience for me. I think to query (optionally) sign people should use dedicated functions (like PyLong_IsNegative()).

The function PyLong_IsNegative() doesn't exist. Or maybe you're thinking at a different function?

skirpichev commented 1 month ago

The function PyLong_IsNegative() doesn't exist.

Yes, but see https://github.com/capi-workgroup/decisions/issues/29

zooba commented 1 month ago

Mean +- std dev: 263 ns +- 2 n Mean +- std dev: 350 ns +- 2 ns

Hm, I guess that's not good enough?

If I'm reading the earlier benchmarks correctly, then we'd have to beat 111ns and 155ns for these, which is clearly only going to be possible by leaking access to Long internals. So it's fine for when using unstable APIs, and the slower path is probably fine for stable callers (it's still faster than failing ;) ). The difference is largely unavoidable function call overhead, a type check, and the (potentially unaligned) memcpy. I daresay we could add a faster path through AsNativeBytes for an aligned destination of sizeof(Py_ssize_t), but it's still unlikely to approach the speed of the macros.

But that's okay, blazing speed is what the unstable API is for, and those who need it can at least opt into using it at their own cost.

I do think that anything returning somewhat raw digits from a LongObject should also return the sign. One call to get all the info to accurately recreate the value, and you only need additional checks if you're creating your own fast path.

vstinner commented 2 days ago

(Or can we drop Py_digit from the ABI and just work with void * instead? In which case we must provide the digit size some other way.)

Ok, done.

We also need a PyLong_DigitArrayWriter_Discard. Finish is trivial now, but if it can do a memory copy in the future, users should call Discard in error paths.

I would prefer to avoid adding such function for now, we can add it later if needed. What do you think?

vstinner commented 2 days ago

@skirpichev @zooba @encukou: I updated the API. What do you think? Are we good yet?

zooba commented 2 days ago

Is there any impact going to come out of the discussions of setting a hard (if theoretical) limit on the length?

vstinner commented 2 days ago

Is there any impact going to come out of the discussions of setting a hard (if theoretical) limit on the length?

@serhiy-storchaka may know about that :-) This API is limited by Py_ssize_t ndigits. If the number of digits fits into a signed size_t, we are good.

The PR https://github.com/python/cpython/pull/123498 defines # define MAX_LONG_DIGITS ((UINT64_MAX-1) / (uint64_t)PyLong_SHIFT) which is unsigned. If we want to be compatible with gh-123498 maximum, the API should be modified to use unsigned digits number. But well, INT64_MAX/PyLong_SHIFT (signed) digits is already is a very big number :-)

encukou commented 2 days ago

I would prefer to avoid adding such function for now, we can add it later if needed. What do you think?

So, to destroy the writer (e.g. in an error handling path), the user should call PyLongWriter_Finish and XDECREF the result?

That works, but I'd much prefer writers to consistently and have both Finish and Discard.

skirpichev commented 2 days ago

If I'm reading the earlier benchmarks correctly, then we'd have to beat 111ns and 155ns for these

Sorry for a late answer @zooba, but no. Those timings are for writing API (mpz->int). Above benchmarks are for reading (int->mpz). I'll repeat them in one place:

For export

On the master:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 233 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 295 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<300' 'mpz(x)'
Mean +- std dev: 491 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<3000' 'mpz(x)'
Mean +- std dev: 2.33 us +- 0.01 us

With new API (see https://github.com/aleaxit/gmpy/pull/495):

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 263 ns +- 37 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 315 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<300' 'mpz(x)'
Mean +- std dev: 508 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<3000' 'mpz(x)'
Mean +- std dev: 2.34 us +- 0.01 us

Without special case for 1-digit:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 291 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 316 ns +- 1 ns

Using PyUnstable_Long_IsCompact (like in proposed above code):

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 215 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 317 ns +- 2 ns

"if you replace PyLong_IsCompact and PyLong_CompactValue with PyLong_AsNativeBytes and a digit_size buffer":

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 263 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'  # huh
Mean +- std dev: 350 ns +- 2 ns

So, using a quick check (currently with PyUnstable_Long_IsCompact) to fail on special functions for small integers, else use PyLong_GetDigits() (only one new function for reading digits) - has better performance then current API.

Maybe you are right and we should introduce a bunch of simple PyUnstable_* functions instead of this more complicated API, which has performance regressions.

clearly only going to be possible by leaking access to Long internals

I doubt that using single layout for "big enough" integers leaks something important. All MP libraries share that detail, hardly CPython will break this in a future.

I updated the API. What do you think? Are we good yet?

I like changes.

vstinner commented 1 day ago

The PR https://github.com/python/cpython/pull/123498 defines # define MAX_LONG_DIGITS ((UINT64_MAX-1) / (uint64_t)PyLong_SHIFT) which is unsigned. If we want to be compatible with gh-123498 maximum, the API should be modified to use unsigned digits number.

I'm wrong. MAX_LONG_DIGITS fits into signed size_t (Py_ssize_t), since it's a number of digits, not a number of bits: / (uint64_t)PyLong_SHIFT makes MAX_LONG_DIGITS smaller than PY_SSIZE_T_MAX.

vstinner commented 1 day ago

@encukou:

So, to destroy the writer (e.g. in an error handling path), the user should call PyLongWriter_Finish and XDECREF the result?

You can prepare your data in advance before creating a writer, so later, your code cannot fail.

That works, but I'd much prefer writers to consistently and have both Finish and Discard.

Ok, I added a PyLongWriter_Discard() function.

vstinner commented 1 day ago

I open a vote on this API:

[ ] @encukou
[ ] @erlend-aasland
[ ] @mdboom
[ ] @serhiy-storchaka
[x] @vstinner
[x] @zooba

skirpichev commented 1 day ago

I would appreciate @casevh opinion, but so far this API has severe performance degradation for reading small integers (int->mpz conversion), up to 25%. Hardly it's acceptable. So, probably projects like gmpy2, sage or flint-python - will use some other code path for such integers.

I'm wrong. MAX_LONG_DIGITS fits into signed size_t (Py_ssize_t), since it's a number of digits, not a number of bits

Yet maybe we should use size_t here, like GMP? Signed integer type looks odd in this context.

zooba commented 1 day ago

I doubt that using single layout for "big enough" integers leaks something important. All MP libraries share that detail, hardly CPython will break this in a future.

We want to preserve code that has already been compiled, which means if we change the size of the digits but we already leaked that into other compiled code (i.e. hardcoded from a macro), then we've broken it. The whole point of the Unstable APIs are to give us more opportunities to make these kinds of changes - otherwise we'd be looking at 3-5 years/releases if we decide that a bigger/smaller digit size is better for everyone.

The stable digit size is 8 bits, and personally I'm quite okay with the performance hit for that stability. But if we are also to guarantee a second digit size with the same stability then we either risk adding a performance hit later (if we now have to start converting digits) or get stuck never gaining anything because we aren't able to change the internal representation. So the best you're going to get here is "fast+unstable (recompile each release)" or "fast+dynamic digit size".

I open a vote on this API:

Only thing I want to change here is the const PyLongLayout *layout argument to PyLongWriter_Create. I'm pretty sure this is a YAGNI situation, and we should just say "call PyLong_GetNativeLayout to get the layout info you need for PyLongWriter_Create" and not allow passing the info in (I certainly don't want to maintain the conversion code - that can go live in gmp and friends with the more complex export functions). I'm happy enough with the rest, though will defer the "is it satisfactory" question to @skirpichev.

casevh commented 1 day ago

My apologies for not following this thread as much as I would have liked. I'll also defer to @skirpichev for the "is it satisfactory" question.

I started optimizing gmpy around version 1.10. The single biggest win was using mpz_import/mpz_export directly into the PyLong internals. The only change to the code was support for 30-bit digits. The same code worked from Python 2.5 through Python 3.11. But this did require version specific builds.

The second win was using the PyLong_AsLongAndOverflow (and its variants) functions for conversion of Python ints into temporary C ints for direct use in many of the GMP function calls. Optimizing the performance of (mpz * 3) + 1 was more important than optimizing mpz(3) and mpz(1) because most of my code did mixed arithmetic. Note: the performance win was primarily from avoiding the creation of an exception.

As long as access to an unstable interface remains, I accept that the stable interface may be slightly slower. If the difference is significant, we will need to release stable and unstable versions.

Thanks for all the effort.

skirpichev commented 1 day ago

if we change the size of the digits but we already leaked that into other compiled code (i.e. hardcoded from a macro), then we've broken it.

Ah, PyLong_GetDigits() declaration has Py_digit*... Well, this can be void*. To properly interpret this output - you need something like PyLong_GetNativeLayout(). I think this saves freedom to change digit size just as the API, proposed by Victor.

Only thing I want to change here is the const PyLongLayout *layout argument to PyLongWriter_Create.

+1

I'm also not sure if we should keep layout field in the PyLong_DigitArray struct. What else it could be if not native layout?!

BTW, all required information about native layout is already available via PyLong_GetInfo(). PyLongLayout has additionally endian and digits_order fields. But I doubt those will be changed in future. So, PyLong_GetInfo() could essentially replace PyLong_GetNativeLayout().

I accept that the stable interface may be slightly slower. If the difference is significant, we will need to release stable and unstable versions.

@casevh, here are some benchmarks. Does this look acceptable for you?

casevh commented 1 day ago

@casevh, https://github.com/capi-workgroup/decisions/issues/35#issuecomment-2328216925 are some benchmarks. Does this look acceptable for you?

Have you tried PyLong_AsLongAndOverflow instead of PyUnstable_Long_IsCompact? It might not be as fast but it is stable (I think).

skirpichev commented 1 day ago

Have you tried PyLong_AsLongAndOverflow instead of PyUnstable_Long_IsCompact?

Yes, I did, just not included this case in above benchmarks. It's good for 1-2 digits and for many digits, but for ~10 - performance has negative impact.

E.g. in the master:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 238 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 289 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<300' 'mpz(x)'
Mean +- std dev: 485 ns +- 5 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<3000' 'mpz(x)'
Mean +- std dev: 2.33 us +- 0.02 us

With proposed API & PyLong_AsLongAndOverflow:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 237 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 254 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<300' 'mpz(x)'
Mean +- std dev: 543 ns +- 5 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<3000' 'mpz(x)'
Mean +- std dev: 2.37 us +- 0.01 us

Hmm, maybe it's a good tradeoff.

encukou commented 23 hours ago

Only thing I want to change here is the const PyLongLayout *layout argument to PyLongWriter_Create. I'm pretty sure this is a YAGNI situation

Yes, later we can add PyLongWriter_CreateWithLayout with an extra argument.

vstinner commented 22 hours ago

Ok, I updated the API:

Remove layout parameter of PyLongWriter_Create().
Remove layout member of PyLong_DigitArray structure.
Use an unsigned type (size_t) for ndigits.

vstinner commented 22 hours ago

@skirpichev: IMO it's fine to rely on PyUnstable_Long_IsCompact() for best performance. Even if it has "unstable" in its name, I expect the API to not change next releases, even if the implementation might change in the future. Well, or you can use PyLong_AsLongAndOverflow(), as you want.

encukou commented 1 hour ago

Use an unsigned type (size_t) for ndigits.

AFAIK, Python uses Py_ssize_t for counting bytes in nearly all existing APIs. I think we should continue doing that, even if GMP uses size_t.

capi-workgroup / decisions

Add import-export API for Python int objects #35

PyLong_LAYOUT

Export: PyLong_AsDigitArray()

Import: PyLongWriter_Create()

Discussions

Benchmarks (one, two, ~ten, ~hundred digits) for gmpy2.

For export

For import

For export