capi-workgroup / decisions

Discussion and voting on specific issues
8 stars 1 forks source link

PEP 757: Add import-export API for Python int objects #35

Closed vstinner closed 1 month ago

vstinner commented 4 months ago

See PEP 757 – C API to import-export Python integers.

skirpichev commented 4 months ago

To play with:

Later I'll show some simple benchmarks, but first please note that current code for gmpy2 do special cases for "small" integers to make performance acceptable. I think it's a common pattern and should be used in the second case as well (current code in the CPython main has a special case for 1-digit ints).

That's why I would prefer more simple and less generic API, like proposed in #31. E.g. for reading:

if (PyLong_IsCompact(obj)) { /* should be a quick check */
    mpz_set_si(z, PyLong_CompactValue(obj));  /* compact value fits in some integer type */
}
else {
    /* fallback to "array of digits" view for abs(obj) */
    const Py_digit *digits = PyLong_GetDigits(obj);  /* can't fail for non-compact ints */
    mpz_import(z, ..., digits);
    if (PyLong_IsNegative(obj) { /* looks better than PyLong_GetSign() ;) */
        mpz_neg(z, z);
    } 
}

(In this example, for simplicity, I assume that compact value fits in long.)

skirpichev commented 4 months ago

Benchmarks (one, two, ~ten, ~hundred digits) for gmpy2.

For export

On the master:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 233 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 295 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<300' 'mpz(x)'
Mean +- std dev: 491 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<3000' 'mpz(x)'
Mean +- std dev: 2.33 us +- 0.01 us

With new API:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 263 ns +- 37 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 315 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<300' 'mpz(x)'
Mean +- std dev: 508 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<3000' 'mpz(x)'
Mean +- std dev: 2.34 us +- 0.01 us

Without special case for 1-digit:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 291 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 316 ns +- 1 ns

Using PyUnstable_Long_IsCompact (like in proposed above code):

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 215 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 317 ns +- 2 ns

New API overhead seems to be noticeable for 1-2 digits case (up to 10-15%).

For import

On the master:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<7)' 'int(x)'
Mean +- std dev: 111 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<38)' 'int(x)'
Mean +- std dev: 155 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<300)' 'int(x)'
Mean +- std dev: 473 ns +- 6 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<3000)' 'int(x)'
Mean +- std dev: 2.20 us +- 0.00 us

With new API:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<7)' 'int(x)'
Mean +- std dev: 111 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<38)' 'int(x)'
Mean +- std dev: 155 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<300)' 'int(x)'
Mean +- std dev: 512 ns +- 5 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=mpz(1<<3000)' 'int(x)'
Mean +- std dev: 2.25 us +- 0.02 us
zooba commented 4 months ago

If PyLong_GetDigits returned the layout information with its call, then it could return ndigits=sizeof(ssize_t); bits_per_digit=8; digits=&compact_value for a compact value, and you shouldn't need the additional checks. The layout could be a pointer to a static struct easily enough, so it's not costly to return, though I prefer the stability of returning a pointer that requires a "free" call of some sort.

Otherwise, if we were to document this function as pairing with PyUnstable_Long_IsCompact then it probably should also be unstable. But I think it's better to leave that optimisation unspecified, if it's not needed for correctness.

PyAPI_DATA(const PyLongLayout) PyLong_LAYOUT;

Please, no more exported data. Make it a function that either returns a pointer to the layout, a function that copies the layout values into a struct, or integrate it with GetDigits as I suggested so that we can support multiple layouts.

// Array endian: // * 1 for most significant byte first (big endian) // * -1 for least significant first (little endian) int8_t array_endian;

Does this meant "most significant digit first"? Or "most significant byte of each digit first"? Or both?

As we'd now use Py_digit in the ABI, is there a need to also provide its size via a function? Or can we rely on callers understanding how their C ABI works? (Or can we drop Py_digit from the ABI and just work with void * instead? In which case we must provide the digit size some other way.)

skirpichev commented 4 months ago

Does this meant "most significant digit first"? Or "most significant byte of each digit first"? Or both?

No, it's about endianness in the digit. We have word_endian to represent the order of digits.

Probably naming could be better. E.g. endian instead of array_endian and/or digits_order instead of word_endian. It seems, CPython consistently uses term "digit" instead of "word" (as libmpdec) or "limb" (as GMP).

Or can we drop Py_digit from the ABI and just work with void * instead?

That seems to be a good idea. PyLongLayout has digit_size field.

encukou commented 4 months ago

I agree with @zooba, and I still stand by my earlier comment. I don't want to lock CPython -- and any future C API implementors -- into using a single layout for all ints.

vstinner commented 4 months ago

Do you mean adding a layout member to PyLong_DigitArray (export) and a layout parameter to PyLongWriter_Create() (import)?

typedef struct PyLong_DigitArray {
    ...
    const PyLongLayout *layout;  // <== NEW
} PyLong_DigitArray;

PyAPI_FUNC(int) PyLong_AsDigitArray(
    PyObject *obj,
    PyLong_DigitArray *array);

PyAPI_FUNC(PyLongWriter*) PyLongWriter_Create(
    ...
    const PyLongLayout *layout);  // <== NEW
skirpichev commented 4 months ago

I don't want to lock CPython -- and any future C API implementors -- into using a single layout for all ints.

JFR, I don't think this locks CPython in any way: this single layout is an asymptotic one (for "big" ints). The API has enough freedom to emulate this view for small integers.

Suggestion @zooba might work with API like (reading case):

typedef struct PyLongLayout {
    uint8_t bits_per_digit;
    uint8_t digit_sizeof;
    int8_t digits_order;
    int8_t digit_endianness;
    Py_ssize_t ndigits;
} PyLongLayout;

/* Returns a pointer to digit array (or NULL, if obj not PyLongObject) and
   set layout.  Which might be different wrt bits_per_digit and digit_sizeof for
   "small" and "big" ints. */
PyAPI_FUNC(const void *) PyLong_AsDigits(PyObject *obj, PyLongLayout *layout);
vstinner commented 4 months ago

The API has enough freedom to emulate this view for small integers.

Currently, exporting a small integer as an array of digits is a O(1) operation. There is no need to allocate memory or anything, all integers are implemented internally as an array of digits.

encukou commented 4 months ago

OK. Let's design API for what we have. If we need additional layouts, that kind of API can be added in the future, and if/when we add it, the API added here will start copying data. And if a non-CPython implementation uses a more elaborate scheme, it'll need to copy data too. I could live with that. Let's design it that way then.

PyLongLayout should be named PyLong_DigitArrayLayout, since the actual internal layout might change in the future.

Same with the writer: PyLong_DigitArrayWriter_Create & PyLong_DigitArrayWriter_Finish.

We also need a PyLong_DigitArrayWriter_Discard. Finish is trivial now, but if it can do a memory copy in the future, users should call Discard in error paths.

We don't need endianness in the struct. I think something like a new medium-int representation (which would need a new API), is much more likely than switching to non-native endianness. (Endian flags make sense if you want to serialize data, and can ask for a particular format, but the in-memory format can just always be native.)


As for avoiding Py_DATA, I filed https://github.com/capi-workgroup/problems/issues/80. I'm OK with a PyLong_GetDigitArrayLayout function rather than a PyLong_LAYOUT constant (or maybe in addition to a PyLong_DIGIT_ARRAY_LAYOUT constant?) but it's not clear to me why it's needed.


@skirpichev said:

I would prefer more simple and less generic API, like proposed in https://github.com/capi-workgroup/decisions/issues/31. E.g. for reading:

  if (PyLong_IsCompact(obj)) { /* should be a quick check */
      mpz_set_si(z, PyLong_CompactValue(obj));  /* compact value fits in some integer type */
  }
  else {
      /* fallback to "array of digits" view for abs(obj) */
      const Py_digit *digits = PyLong_GetDigits(obj);  /* can't fail for non-compact ints */
      mpz_import(z, ..., digits);
      if (PyLong_IsNegative(obj) { /* looks better than PyLong_GetSign() ;) */
          mpz_neg(z, z);
      } 
  }

Would it work if you replace PyLong_IsCompact and PyLong_CompactValue with PyLong_AsNativeBytes and a digit_size buffer? That will do extra work for long ints, but the amount of extra work might be small enough (and offset by only calling one function). If it's not fast enough, maybe we should add a Py_ASNATIVEBYTES_COMPACT_ONLY flag, which would make PyLong_AsNativeBytes leave buffer untouched and return 0 when you should use PyLong_AsDigitArray.


Looking at the above example: most callers will need the sign bit, so PyLong_AsDigitArray should return 1 for negative numbers. (Or it could get a int *sign output argument like with PyLong_GetSign, but distinguishing 0 from positives doesn't sound too useful here.)

skirpichev commented 4 months ago

We don't need endianness in the struct.

You meant word_endian too?

Would it work if you replace PyLong_IsCompact and PyLong_CompactValue with PyLong_AsNativeBytes and a digit_size buffer?

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 263 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'  # huh
Mean +- std dev: 350 ns +- 2 ns
patch wrt gmpy2 test branch ```diff diff --git a/src/gmpy2_convert_gmp.c b/src/gmpy2_convert_gmp.c index 34f2546..e732080 100644 --- a/src/gmpy2_convert_gmp.c +++ b/src/gmpy2_convert_gmp.c @@ -44,24 +44,25 @@ static void mpz_set_PyLong(mpz_t z, PyObject *obj) { - static PyLong_DigitArray long_export; - - PyLong_AsDigitArray(obj, &long_export); - if (long_export.ndigits == 1) { - mpz_set_si(z, long_export.digits[0]); + long value; + Py_ssize_t bytes = PyLong_AsNativeBytes(obj, &value, sizeof(long), -1); + if (0 <= bytes && bytes <= (Py_ssize_t)sizeof(long)) { + mpz_set_si(z, value); } else { + static PyLong_DigitArray long_export; + PyLong_AsDigitArray(obj, &long_export); mpz_import(z, long_export.ndigits, PyLong_LAYOUT.array_endian, PyLong_LAYOUT.digit_size, PyLong_LAYOUT.word_endian, PyLong_LAYOUT.digit_size*8 - PyLong_LAYOUT.bits_per_digit, long_export.digits); + if (long_export.negative) { + mpz_neg(z, z); + } + PyLong_FreeDigitArray(&long_export); } - if (long_export.negative) { - mpz_neg(z, z); - } - PyLong_FreeDigitArray(&long_export); } static MPZ_Object * ```

Looking at the above example: most callers will need the sign bit

Not necessary. This view rather represents the absolute value.

encukou commented 4 months ago

You meant word_endian too?

Yes. I see no reason to single that out as a detail that might change.

Mean +- std dev: 263 ns +- 2 n Mean +- std dev: 350 ns +- 2 ns

Hm, I guess that's not good enough? Does the Py_ASNATIVEBYTES_COMPACT_ONLY idea sound reasonable?

most callers will need the sign bit

I meant that instead of

        int result = PyLong_AsDigitArray(obj, &long_export);
        if (result < 0) goto error;
        mpz_import(...);
        if (long_export.negative) {
            mpz_neg(z, z);
        }
        PyLong_FreeDigitArray(&long_export);

you'd do

        int sign = PyLong_AsDigitArray(obj, &long_export);
        if (sign < 0) goto error;
        mpz_import(...);
        if (sign) {
            mpz_neg(z, z);
        }
        PyLong_FreeDigitArray(&long_export);
skirpichev commented 4 months ago

Hm, I guess that's not good enough?

Yes, especially the second case. You can see other benchmarks above.

Does the Py_ASNATIVEBYTES_COMPACT_ONLY idea sound reasonable?

Yes, if PyLong_AsNativeBytes can do a quick exit - that should be an alternative to PyLong_IsCompact/CompactValue.

int sign = PyLong_AsDigitArray(obj, &long_export);

This looks as a minor convenience for me. I think to query (optionally) sign people should use dedicated functions (like PyLong_IsNegative()).

encukou commented 4 months ago

Oh, I forgot we already have PyUnstable_Long_IsCompact exposed. That's the check a Py_ASNATIVEBYTES_COMPACT_ONLY would do in the current implementation.

If you're chasing nanoseconds, I don't think you want to compile for stable ABI, and that you will want to change your code if the internals change. So, PyUnstable looks OK.

(That said, consider having the optimizations in #ifndef Py_LIMITED_API, and releasing stable-ABI wheels alongside the version-specific fast ones. Testers of CPython alpha releases would thank you, and hopefully, eventually, so would PyPy users.)

vstinner commented 3 months ago

@skirpichev:

Probably naming could be better. E.g. endian instead of array_endian and/or digits_order instead of word_endian. It seems, CPython consistently uses term "digit" instead of "word" (as libmpdec) or "limb" (as GMP).

I renamed word_endian to digits_order and array_endian to endian.

@encukou:

I agree with @zooba, and I still stand by my https://github.com/capi-workgroup/decisions/issues/31#issuecomment-2188269464. I don't want to lock CPython -- and any future C API implementors -- into using a single layout for all ints.

Ok, I modified the API to add layout parameters to import and export APIs.

vstinner commented 3 months ago

@skirpichev:

This looks as a minor convenience for me. I think to query (optionally) sign people should use dedicated functions (like PyLong_IsNegative()).

The function PyLong_IsNegative() doesn't exist. Or maybe you're thinking at a different function?

skirpichev commented 3 months ago

The function PyLong_IsNegative() doesn't exist.

Yes, but see https://github.com/capi-workgroup/decisions/issues/29

zooba commented 3 months ago

Mean +- std dev: 263 ns +- 2 n Mean +- std dev: 350 ns +- 2 ns

Hm, I guess that's not good enough?

If I'm reading the earlier benchmarks correctly, then we'd have to beat 111ns and 155ns for these, which is clearly only going to be possible by leaking access to Long internals. So it's fine for when using unstable APIs, and the slower path is probably fine for stable callers (it's still faster than failing ;) ). The difference is largely unavoidable function call overhead, a type check, and the (potentially unaligned) memcpy. I daresay we could add a faster path through AsNativeBytes for an aligned destination of sizeof(Py_ssize_t), but it's still unlikely to approach the speed of the macros.

But that's okay, blazing speed is what the unstable API is for, and those who need it can at least opt into using it at their own cost.

I do think that anything returning somewhat raw digits from a LongObject should also return the sign. One call to get all the info to accurately recreate the value, and you only need additional checks if you're creating your own fast path.

vstinner commented 2 months ago

(Or can we drop Py_digit from the ABI and just work with void * instead? In which case we must provide the digit size some other way.)

Ok, done.

We also need a PyLong_DigitArrayWriter_Discard. Finish is trivial now, but if it can do a memory copy in the future, users should call Discard in error paths.

I would prefer to avoid adding such function for now, we can add it later if needed. What do you think?

vstinner commented 2 months ago

@skirpichev @zooba @encukou: I updated the API. What do you think? Are we good yet?

zooba commented 2 months ago

Is there any impact going to come out of the discussions of setting a hard (if theoretical) limit on the length?

vstinner commented 2 months ago

Is there any impact going to come out of the discussions of setting a hard (if theoretical) limit on the length?

@serhiy-storchaka may know about that :-) This API is limited by Py_ssize_t ndigits. If the number of digits fits into a signed size_t, we are good.

The PR https://github.com/python/cpython/pull/123498 defines # define MAX_LONG_DIGITS ((UINT64_MAX-1) / (uint64_t)PyLong_SHIFT) which is unsigned. If we want to be compatible with gh-123498 maximum, the API should be modified to use unsigned digits number. But well, INT64_MAX/PyLong_SHIFT (signed) digits is already is a very big number :-)

encukou commented 2 months ago

I would prefer to avoid adding such function for now, we can add it later if needed. What do you think?

So, to destroy the writer (e.g. in an error handling path), the user should call PyLongWriter_Finish and XDECREF the result?

That works, but I'd much prefer writers to consistently and have both Finish and Discard.

skirpichev commented 2 months ago

If I'm reading the earlier benchmarks correctly, then we'd have to beat 111ns and 155ns for these

Sorry for a late answer @zooba, but no. Those timings are for writing API (mpz->int). Above benchmarks are for reading (int->mpz). I'll repeat them in one place:

For export

On the master:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 233 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 295 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<300' 'mpz(x)'
Mean +- std dev: 491 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<3000' 'mpz(x)'
Mean +- std dev: 2.33 us +- 0.01 us

With new API (see https://github.com/aleaxit/gmpy/pull/495):

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 263 ns +- 37 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 315 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<300' 'mpz(x)'
Mean +- std dev: 508 ns +- 3 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<3000' 'mpz(x)'
Mean +- std dev: 2.34 us +- 0.01 us

Without special case for 1-digit:

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 291 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 316 ns +- 1 ns

Using PyUnstable_Long_IsCompact (like in proposed above code):

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 215 ns +- 1 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'
Mean +- std dev: 317 ns +- 2 ns

"if you replace PyLong_IsCompact and PyLong_CompactValue with PyLong_AsNativeBytes and a digit_size buffer":

$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<7' 'mpz(x)'
Mean +- std dev: 263 ns +- 2 ns
$ python -m pyperf timeit -q -s 'from gmpy2 import mpz;x=1<<38' 'mpz(x)'  # huh
Mean +- std dev: 350 ns +- 2 ns

So, using a quick check (currently with PyUnstable_Long_IsCompact) to fail on special functions for small integers, else use PyLong_GetDigits() (only one new function for reading digits) - has better performance then current API.

Maybe you are right and we should introduce a bunch of simple PyUnstable_* functions instead of this more complicated API, which has performance regressions.

clearly only going to be possible by leaking access to Long internals

I doubt that using single layout for "big enough" integers leaks something important. All MP libraries share that detail, hardly CPython will break this in a future.

I updated the API. What do you think? Are we good yet?

I like changes.

vstinner commented 2 months ago

The PR https://github.com/python/cpython/pull/123498 defines # define MAX_LONG_DIGITS ((UINT64_MAX-1) / (uint64_t)PyLong_SHIFT) which is unsigned. If we want to be compatible with gh-123498 maximum, the API should be modified to use unsigned digits number.

I'm wrong. MAX_LONG_DIGITS fits into signed size_t (Py_ssize_t), since it's a number of digits, not a number of bits: / (uint64_t)PyLong_SHIFT makes MAX_LONG_DIGITS smaller than PY_SSIZE_T_MAX.

vstinner commented 2 months ago

@encukou:

So, to destroy the writer (e.g. in an error handling path), the user should call PyLongWriter_Finish and XDECREF the result?

You can prepare your data in advance before creating a writer, so later, your code cannot fail.

That works, but I'd much prefer writers to consistently and have both Finish and Discard.

Ok, I added a PyLongWriter_Discard() function.

vstinner commented 2 months ago

I open a vote on this API:

skirpichev commented 2 months ago

I would appreciate @casevh opinion, but so far this API has severe performance degradation for reading small integers (int->mpz conversion), up to 25%. Hardly it's acceptable. So, probably projects like gmpy2, sage or flint-python - will use some other code path for such integers.

I'm wrong. MAX_LONG_DIGITS fits into signed size_t (Py_ssize_t), since it's a number of digits, not a number of bits

Yet maybe we should use size_t here, like GMP? Signed integer type looks odd in this context.

zooba commented 2 months ago

I doubt that using single layout for "big enough" integers leaks something important. All MP libraries share that detail, hardly CPython will break this in a future.

We want to preserve code that has already been compiled, which means if we change the size of the digits but we already leaked that into other compiled code (i.e. hardcoded from a macro), then we've broken it. The whole point of the Unstable APIs are to give us more opportunities to make these kinds of changes - otherwise we'd be looking at 3-5 years/releases if we decide that a bigger/smaller digit size is better for everyone.

The stable digit size is 8 bits, and personally I'm quite okay with the performance hit for that stability. But if we are also to guarantee a second digit size with the same stability then we either risk adding a performance hit later (if we now have to start converting digits) or get stuck never gaining anything because we aren't able to change the internal representation. So the best you're going to get here is "fast+unstable (recompile each release)" or "fast+dynamic digit size".

I open a vote on this API:

Only thing I want to change here is the const PyLongLayout *layout argument to PyLongWriter_Create. I'm pretty sure this is a YAGNI situation, and we should just say "call PyLong_GetNativeLayout to get the layout info you need for PyLongWriter_Create" and not allow passing the info in (I certainly don't want to maintain the conversion code - that can go live in gmp and friends with the more complex export functions). I'm happy enough with the rest, though will defer the "is it satisfactory" question to @skirpichev.

casevh commented 2 months ago

My apologies for not following this thread as much as I would have liked. I'll also defer to @skirpichev for the "is it satisfactory" question.

I started optimizing gmpy around version 1.10. The single biggest win was using mpz_import/mpz_export directly into the PyLong internals. The only change to the code was support for 30-bit digits. The same code worked from Python 2.5 through Python 3.11. But this did require version specific builds.

The second win was using the PyLong_AsLongAndOverflow (and its variants) functions for conversion of Python ints into temporary C ints for direct use in many of the GMP function calls. Optimizing the performance of (mpz * 3) + 1 was more important than optimizing mpz(3) and mpz(1) because most of my code did mixed arithmetic. Note: the performance win was primarily from avoiding the creation of an exception.

As long as access to an unstable interface remains, I accept that the stable interface may be slightly slower. If the difference is significant, we will need to release stable and unstable versions.

Thanks for all the effort.

skirpichev commented 2 months ago

if we change the size of the digits but we already leaked that into other compiled code (i.e. hardcoded from a macro), then we've broken it.

Ah, PyLong_GetDigits() declaration has Py_digit*... Well, this can be void*. To properly interpret this output - you need something like PyLong_GetNativeLayout(). I think this saves freedom to change digit size just as the API, proposed by Victor.

Only thing I want to change here is the const PyLongLayout *layout argument to PyLongWriter_Create.

+1

I'm also not sure if we should keep layout field in the PyLong_DigitArray struct. What else it could be if not native layout?!

BTW, all required information about native layout is already available via PyLong_GetInfo(). PyLongLayout has additionally endian and digits_order fields. But I doubt those will be changed in future. So, PyLong_GetInfo() could essentially replace PyLong_GetNativeLayout().

I accept that the stable interface may be slightly slower. If the difference is significant, we will need to release stable and unstable versions.

@casevh, here are some benchmarks. Does this look acceptable for you?

casevh commented 2 months ago

@casevh, https://github.com/capi-workgroup/decisions/issues/35#issuecomment-2328216925 are some benchmarks. Does this look acceptable for you?

Have you tried PyLong_AsLongAndOverflow instead of PyUnstable_Long_IsCompact? It might not be as fast but it is stable (I think).

skirpichev commented 2 months ago

Have you tried PyLong_AsLongAndOverflow instead of PyUnstable_Long_IsCompact?

Yes, I did, just not included this case in above benchmarks. It's good for 1-2 digits and for many digits, but for ~10 - performance has negative impact.

benchmarks code ```python # r.py; reading import pyperf from gmpy2 import mpz runner = pyperf.Runner() runner.bench_func('1<<7', mpz, 1 << 7) runner.bench_func('1<<38', mpz, 1 << 38) runner.bench_func('1<<300', mpz, 1 << 300) runner.bench_func('1<<3000', mpz, 1 << 3000) ``` ```python # w.py; writing import pyperf from gmpy2 import mpz runner = pyperf.Runner() runner.bench_func('1<<7', int, mpz(1 << 7)) runner.bench_func('1<<38', int, mpz(1 << 38)) runner.bench_func('1<<300', int, mpz(1 << 300)) runner.bench_func('1<<3000', int, mpz(1 << 3000)) ```
Benchmark results for reading (now my pr uses PyLong_AsLongAndOverflow): Benchmark master new-api
1<<7 263 ns 268 ns: 1.02x slower
1<<38 317 ns 276 ns: 1.15x faster
1<<300 514 ns 588 ns: 1.14x slower
1<<3000 2.35 us 2.42 us: 1.03x slower
Geometric mean (ref) 1.01x slower

Hmm, maybe it's a good tradeoff.

JFR, benchmark results for writing: Benchmark master new-api
1<<38 213 ns 210 ns: 1.02x faster
1<<300 535 ns 575 ns: 1.07x slower
1<<3000 2.27 us 2.32 us: 1.02x slower
Geometric mean (ref) 1.02x slower

Benchmark hidden because not significant (1): 1<<7

encukou commented 2 months ago

Only thing I want to change here is the const PyLongLayout *layout argument to PyLongWriter_Create. I'm pretty sure this is a YAGNI situation

Yes, later we can add PyLongWriter_CreateWithLayout with an extra argument.

vstinner commented 2 months ago

Ok, I updated the API:

vstinner commented 2 months ago

@skirpichev: IMO it's fine to rely on PyUnstable_Long_IsCompact() for best performance. Even if it has "unstable" in its name, I expect the API to not change next releases, even if the implementation might change in the future. Well, or you can use PyLong_AsLongAndOverflow(), as you want.

encukou commented 2 months ago

Use an unsigned type (size_t) for ndigits.

AFAIK, Python uses Py_ssize_t for counting bytes in nearly all existing APIs. I think we should continue doing that, even if GMP uses size_t.

vstinner commented 2 months ago

@encukou:

AFAIK, Python uses Py_ssize_t for counting bytes in nearly all existing APIs. I think we should continue doing that, even if GMP uses size_t.

@skirpichev:

Yet maybe we should use size_t here, like GMP? Signed integer type looks odd in this context.

@skirpichev: are you ok to use signed Py_ssize_t?

skirpichev commented 2 months ago

are you ok to use signed Py_ssize_t?

This still look odd for me, but I agree with Petr - Py_ssize_t seems to be more consistent with the CPython API. size_t used in *alloc() functions.

vstinner commented 2 months ago

Ok, I updated the API again to move back to signed Py_ssize_t.

vstinner commented 2 months ago

@skirpichev: Are you ok with the latest API?

skirpichev commented 2 months ago

@vstinner, yes. Some minor nitpicks were mentioned above, feel free to ignore:

(PyLong_GetInfo() is slow, but it's not a problem, as it might be called just once at module initialization.)

vstinner commented 2 months ago

@vstinner:

@skirpichev: Are you ok with the latest API?

@skirpichev:

@vstinner, yes

Ah, good :-)

@skirpichev:

BTW, all required information about native layout is already available via PyLong_GetInfo(). PyLongLayout has additionally endian and digits_order fields. But I doubt those will be changed in future. So, PyLong_GetInfo() could essentially replace PyLong_GetNativeLayout().

The purpose of this API is to stop relying on Python internals, but instead offer an API which expose all information needed by these features (import/export). Maybe CPython internals will not change, but maybe PyPy uses a different layout and can decide to change its layout between two PyPy versions. MicroPython, GraalPython, RustPython and others can also use a different layout. IMO it's worth it exposing "digits_order" and "endian". Also, a structure is more convenient than a namedtuple in C.

Well, an alternative would be to complete sys.int_info to add digits_order and endian, but I don't know if it makes sense to expose these info in Python.

For reference, sys.int_info doc: https://docs.python.org/dev/library/sys.html#sys.int_info

skirpichev commented 2 months ago

maybe PyPy uses a different layout

Certainly, no. And I would be surprised if some bigint library (including Python implementations) using something different here. Can anyone point on examples?

Well, an alternative would be to complete sys.int_info to add digits_order and endian,

Or extend only PyLong_GetInfo() output.

zooba commented 2 months ago

I would be surprised if some bigint library (including Python implementations) using something different here. Can anyone point on examples?

You'd be surprised if there's a library out there that doesn't use 15-bit (or 30-bit) digits? I'd honestly be surprised if there was a bigint library that used those sizes, at least by default. Do they really all do it that way?

skirpichev commented 2 months ago

You'd be surprised if there's a library out there that doesn't use 15-bit (or 30-bit) digits?

No, it's fine: GMP is an example (that's why digit_size & bits_per_digit - really useful information). But GMP has same behaviour wrt digits_order and endian parameters. Is there a bigint library, where least significant "digit" ("limb", whatever) - stored last?

zooba commented 2 months ago

Ah, you just meant the digit order.

On one hand, they may not exist and we may not imagine a reason why they would, but on the other hand, that's why we didn't have this function in the first place. Better to include all the options in there, just in case, since we've already failed to predict the future once 😉

vstinner commented 2 months ago

It became difficult for me to follow this discussion and navigate between messages since [the PR] has 119 comments and this issue has 47 comments, and there are a few more related issues. So I decided to write down PEP 757 to summarize the discussion: https://github.com/python/peps/pull/3961. It might be easier to make a decision on a PEP.

vstinner commented 2 months ago

I wrote PEP 757 – C API to import-export Python integers, it's now online! I announced the PEP on discuss.python.org as well.

There is an open questions (by @skirpichev):

I like PyLong_GetNativeLayout() and I prefer to keep it. I don't think that it's redundant since it uses C types which are more convenient to use in C. And I'm not sure about adding digits_order and endian members to sys.int_info: in Python, there are not useful information.

skirpichev commented 2 months ago

@vstinner, now description is outdated wrt the current proposal. Maybe you can sync?

vstinner commented 2 months ago

@vstinner, now description is outdated wrt the current proposal. Maybe you can sync?

PEP 757 now contains all details. So I just replaced the first message with a reference to PEP 757.