capi-workgroup / decisions

Discussion and voting on specific issues
8 stars 1 forks source link

Vote on PEP 757 – C API to import-export Python integers #45

Closed vstinner closed 1 day ago

vstinner commented 1 month ago

Vote on PEP 757 – C API to import-export Python integers.

Since I co-authored PEP 757, I abstain from voting on my own PEP :-)

I closed the previous issue since it had a long history and the API changed multiple times.

encukou commented 1 month ago

I'd prefer being able to deprecate PyLong_Export -- there's no version directly in struct PyLongExport, so if things change too much we'll need to add new API. And that's OK, but we will want deprecation/removal in that case. Otherwise, +1 from me!

skirpichev commented 1 month ago

@encukou,

  1. How long API should live? E.g. mpz_import/export API was in the GMP since 2002.
  2. PyLongLayout has no version field too. Are you ok with that?
  3. Which changes do you want to address? I think we should keep something in mind, even highly hypothetical, if going to add a version field.

I address d.p.o comment also here:

This function always succeeds if obj is a Python int object or a subclass. [...] This means we can never deprecate this function. We might want to do that if, for example, we introduce an internal format that doesn’t fit in the struct PyLongExport parameters. I suggest removing the line.

Probably, this concern also could be resolved with a new version field, right? We did our best to preserve "always successful" constraint and I would like to keep it. (And it requires much less code adaptation on the gmpy2 side.)

Edit: lets look closer on the structure

typedef struct PyLongExport {
    int64_t value;
    uint8_t negative;
    Py_ssize_t ndigits;
    const void *digits;
    // Member used internally, must not be used for other purpose.
    Py_uintptr_t _reserved;
} PyLongExport;

Most public fields (negative/ndigits/digits) are required, else the whole API doesn't make sense. So, we left with one field: value - that part might be versioned. How about this (which is closer to the original proposal by @pitrou):

typedef struct PyLongExport2 {
    uint8_t is_compact;
    union {
        struct {
            uint8_t version;
            int64_t value;
        } compact_value;
        struct {
            uint8_t negative;
            Py_ssize_t ndigits;
            const void *digits;
            // _reserved could be actually here.
        } digit_array;
    };
    // Member used internally, must not be used for other purpose.
    Py_uintptr_t _reserved;
} PyLongExport2;

Or we could use just reserve enough room for compact_value:

struct {
    int64_t value[2];
} compact_value;
encukou commented 1 month ago

How long API should live?

As long as we can support it, but it shouldn't be forced to live forever :)

mpz_import/export API was in the GMP since 2002.

Sure, but AFAIK there's no expectation of that being a zero-copy mechanism, and it is in a library that's expected to provide conversions between arbitrary formats. It seems that adding a mpz_export2 with an extra argument would be too much trouble for GMP.

But, not being able to fail with MemoryError, PyLong_Export practically can't switch away from being a zero-copy mechanism. The API prevents future CPython (and any other implementations that use the C API) from using alternative int representations internally.

PyLongLayout has no version field too. Are you ok with that?

Oh, right -- a version field would be better on PyLongLayout; then PyLongExport wouldn't need one.

No version anywhere means that if/when PyLongLayout/PyLongExport no longer fit, we'll need to deprecate PyLong_Export itself. That's an OK plan, if we allow ourselves to deprecate it (which means raising exceptions under -Werror).

Which changes do you want to address?

Examples are 2's complement, or a 64-bit compact value with a separate sign bit.

But the ability to deprecate functions is mainly useful for unknown unknowns. I can't see decades ahead.

skirpichev commented 1 month ago

But, not being able to fail with MemoryError, PyLong_Export practically can't switch away from being a zero-copy mechanism. The API prevents future CPython (and any other implementations that use the C API) from using alternative int representations internally.

Not all. Only if this representation can't fit in 64-bit value. But if this field is too small - we can make it versioned or just bigger from start.

Oh, right -- a version field would be better on PyLongLayout; then PyLongExport wouldn't need one.

Then I'm lost :( The PyLongLayout has nothing to say about representation for "compact" values, that fit in some machine-sized integer.

It's used to describe "asymptotic" representation of CPython integers as "digit array". (Maybe PEP isn't clear enough here?) Which else field we might want to add to this structure?

Examples are 2's complement, or a 64-bit compact value with a separate sign bit.

I'm assuming you meant 2's complement for "compact" values, not for representation of big integers. Then both variants could be solved as described above:

  1. versioned structure compact_value in the union
  2. or just more big structure, say 128-bit

If this complication is required - I'm slightly biased to (1).

vstinner commented 1 month ago

versioned structure compact_value in the union

I dislike anonymous union, it caused us some troubles in the past.

or just more big structure, say 128-bit

I dislike having two numbers, you have to define the order. I don't think that it's needed. I suggest to revisit the API once Python will need 128-bit. For now, it only needs 60-bit.

vstinner commented 1 month ago

Examples are 2's complement, or a 64-bit compact value with a separate sign bit.

We can make a small change to PEP 757:

It doubles the capacity of PyLongExport.value.

pitrou commented 1 month ago

We can make a small change to PEP 757:

What would be the point though? How likely is it that CPython switches to an uncommon "uint64 + sign bit" representation? @markshannon what do you think?

vstinner commented 1 month ago

What would be the point though?

It may simplify the code using this API: https://peps.python.org/pep-0757/#export-pylong-export-with-gmpy2

Currently, this code is needed:

            mpz_import(z, 1, -1, sizeof(int64_t), 0, 0, &value);
            if (value < 0) {
                mpz_t tmp;
                mpz_init(tmp);
                mpz_ui_pow_ui(tmp, 2, 64);
                mpz_sub(z, z, tmp);
                mpz_clear(tmp);
            }

I expect that it can be simplified to the following code which should be more efficient:

            mpz_import(z, 1, -1, sizeof(uint64_t), 0, 0, &value);
            if (long_export.negative) {
                mpz_neg(z, z);
            }

Note: currently, only Windows uses this code path, since Windows only has 32-bit C long for mpz_set_si() fast-path.

encukou commented 1 month ago

I should never have mentioned any examples.

A PyLong_Export that cannot fail (for ints), cannot be deprecated, and cannot switch to copying data, limits the internal int representation (in CPython and other implementations of the C API) to only what's exportable through PyLongExport. IMO, that is unacceptable. We need a way to evolve in the future, not guess now what the future might be.

vstinner commented 1 month ago

I'd prefer being able to deprecate PyLong_Export -- there's no version directly in struct PyLongExport, so if things change too much we'll need to add new API. And that's OK, but we will want deprecation/removal in that case.

I wrote https://github.com/python/peps/pull/4025 to allow PyLong_Export() to fail.

pitrou commented 1 month ago

It may simplify the code using this API: https://peps.python.org/pep-0757/#export-pylong-export-with-gmpy2

The goal of this API is not to simplify consumer code, it's to make the export as efficient as possible.

To make the export as efficient as possible, it must be as close as possible to CPython's internal representation.

CPython is much more likely to adopt a int64 representation internally, than a "uint64 + sign bit" representation. The latter simply doesn't make sense, because it can't fit in a machine register.

Thus the export API should favor int64 over "uint64 + sign bit".

skirpichev commented 1 month ago

I dislike anonymous union, it caused us some troubles in the past.

Was it before or after adding the C11 compiler requirement? With union, the structure will be also smaller (both in the current variant & with version field).

I dislike having two numbers, you have to define the order.

Native. Something like that does make sense for me with hypothetical 128-bit machine int's. Using e.g. uint64_t + sign doesn't look as a possible future for CPython integers.

I don't think that it's needed.

What about version field?

vstinner commented 1 month ago

Was it before or after adding the C11 compiler requirement?

In short, the Python C API only requires C99, not C11: see the long discussion.

vstinner commented 1 month ago

@encukou:

I'd prefer being able to deprecate PyLong_Export -- there's no version directly in struct PyLongExport, so if things change too much we'll need to add new API. And that's OK

I don't think that we should add a version member. I don't see how users would be supposed to use it. If things change, I would prefer a new API, as @encukou wrote.

encukou commented 1 month ago

The union doesn't need to be anonymous, let's just name it.

How about:

typedef struct PyLongExport2 {
    // type also serves as a version field: on unknown type,
    // users must fall back to e.g. PyLong_AsNativeBytes.
    // (And it's big because the next field is
    // pointer-aligned, there's a lot of padding anyway.)
    uint32_t type;  
    union {
        int64_t compact_value;
        struct {
            Py_ssize_t ndigits;
            const void *digits;
            uint8_t negative;

            // Member used internally, must not be used for other purpose.
            Py_uintptr_t _reserved;
        } digit_array;
    } data;
} PyLongExport2;

(edit: the union should be last so the struct can grow)

Maybe this discussion should be on Discourse, not here...

vstinner commented 1 month ago

The union doesn't need to be anonymous, let's just name it.

Which problem are you trying to address with this different structure?

encukou commented 1 month ago

Which problem are you trying to address with this different structure?

Compared to @skirpichev's earlier proposal:

skirpichev commented 1 month ago

The union doesn't need to be anonymous, let's just name it.

I'm not sure if anonymous unions are banned from public API. (The long discussion seems to be unfinished.)

But the rest clearly does make sense. This even more closer to the original proposal by Antoine.

vstinner commented 1 month ago

@erlend-aasland @mdboom @serhiy-storchaka: What's your call on the PEP 757?

skirpichev commented 1 month ago

Note that Import-API (which requires also PyLongLayout description for int's) from previous discussions seems much less controversial. I think it helps if C-API will have at least that part of PEP.

Maybe it worth to make separate pools? The Export API + Layout API and the Import API + Layout API.

vstinner commented 1 month ago

@encukou:

I'd prefer being able to deprecate PyLong_Export -- there's no version directly in struct PyLongExport, so if things change too much we'll need to add new API. And that's OK, but we will want deprecation/removal in that case. Otherwise, +1 from me!

We modified PEP 757 to remove the following sentence from PyLong_Export() function:

This function always succeeds if obj is a Python int object or a subclass.

So you can now deprecate the function by emitting DeprecationWarning, which can set an exception if the warning is treated as an error.

vstinner commented 1 month ago

Nobody voted so far. Does it mean that you're against PEP 757, or just that you are waiting until the discussion settles down?

encukou commented 1 month ago

I'm +0 for the current PEP. It does the job, but I think the union-based API from the discussion could do it better.

skirpichev commented 1 month ago

@encukou, here you suggest that "other workarounds" for your concerns (including union-based API, where PyLong_Export() returns a type of export) "IMO worse". Did you change your mind or I miss something?

What do you think on the Import API (layout + PyLongWriter)? To me it looks PEP going to be rejected. If so, I don't see good reasons to loose that part (which got much less criticism) too.

encukou commented 1 month ago

Yes, I changed my mind. But not much, I'm still +0 here.

A disadvantage of the checkbox-based voting system is that it doesn't distinguish between “no”, “+0”, and “didn't vote yet”.

The import API looks is fine to me.

vstinner commented 3 weeks ago

@mdboom @serhiy-storchaka @zooba: Would you mind to review PEP 757 (and vote)?

zooba commented 2 weeks ago

One minor nit (which can be clarified in docs, as far as I'm concerned, rather than having to update the PEP) is the lifetime of the digits provided to PyLongWriter_Create - can I change/free it immediately after Create or do I have to wait until Finish/Discard?

There's no mention in the PEP of limited API - is this going in? I think we should also add PyLong_AsNativeBytes and FromNativeBytes to the limited API at the same time, and I'm okay with that being in 3.14.

skirpichev commented 2 weeks ago

rather than having to update the PEP

(I think it's fine, if updates are minor. Then we can copy-paste docs to the implementation pr.)

can I change/free it immediately after Create or do I have to wait until Finish/Discard?

But there is no other options to free digits. It's lifetime is same as for the PyLongWriter object; it's not a caller job to allocate memory for this array - that does PyLongWriter_Create. And to free it, here is the rule: "The caller can either initialize the array of digits digits and then call PyLongWriter_Finish() to get a Python int, or call PyLongWriter_Discard() to destroy the writer instance."

There's no mention in the PEP of limited API - is this going in?

I think it should be good enough for this. @vstinner?

But note that now e.g. https://github.com/capi-workgroup/decisions/issues/29 is not going to be in the limited API. That's a better candidate.

I think we should also add PyLong_AsNativeBytes and FromNativeBytes to the limited API

And this; +1.

zooba commented 2 weeks ago

But there is no other options to free digits.

You're right, I misread. It's an out parameter (which makes far more sense). So yes, no need to change anything.

vstinner commented 2 weeks ago

There's no mention in the PEP of limited API - is this going in?

I would prefer to not add these APIs to the limited C API in Python 3.14. There is no need for that right now, it can be done later.

vstinner commented 2 weeks ago

@zooba:

One minor nit (which can be clarified in docs, as far as I'm concerned, rather than having to update the PEP) is the lifetime of the digits provided to PyLongWriter_Create - can I change/free it immediately after Create or do I have to wait until Finish/Discard?

You must call Finish or Discard to free digits. You cannot/must not free it explicitly.

Technically, digits is a pointer to an integer digits: the memory is a PyLongObject object. But it's an implementation detail ;-)

vstinner commented 2 weeks ago

@mdboom @serhiy-storchaka: Gentle reminder: Would you mind to review PEP 757 (and vote)? Don't hesitate if you have doubts or questions.

skirpichev commented 1 week ago

Based on @serhiy-storchaka feedback in https://github.com/python/cpython/issues/111140#issuecomment-1773710503, I would guess that he is seconded to the Mark's opinion in the discussion thread. I.e. it's better to have mpz_import/export-like API on the CPython side.

We tried to address this concern here: https://peps.python.org/pep-0757/#provide-mpz-import-export-like-api-instead (this slightly updated by https://github.com/python/peps/pull/4126).

vstinner commented 4 days ago

This issue is open for almost 2 months. PEP 757 got 4 positive votes, but @serhiy-storchaka didn't vote yet. Does the vote need 100% majority?

serhiy-storchaka commented 1 day ago

I am very sorry for being a hindrance. I would prefer more mpz_import/mpz_export like API, but it has serious drawbacks: more complex code, not zero-copy, and most importantly, it only works with absolute values -- this makes it implementation dependent. The PEP addresses this option. Now, I don't want to delay the vote even further by bikeshedding. I am +1.

vstinner commented 1 day ago

@serhiy-storchaka:

I am very sorry for being a hindrance. I would prefer more mpz_import/mpz_export like API, but it has serious drawbacks: more complex code, not zero-copy, and most importantly, it only works with absolute values -- this makes it implementation dependent. The PEP addresses this option. Now, I don't want to delay the vote even further by bikeshedding. I am +1.

Thanks Serhiy. I mark the PEP as Accepted and close the issue.

The PR https://github.com/python/cpython/pull/121339 is now ready for your final review :-)

skirpichev commented 1 day ago

I am very sorry for being a hindrance.

(You are not. 3.14.b1 scheduled on 2025-05-06. It's a long way to...)

not zero-copy

That depends on the lib. Probably, most will offer GMP-like access to internals.

it only works with absolute values

I would like to see some bigints library, with a different internals (say 2s complement).

Now, I don't want to delay the vote even further by bikeshedding.

(I would appreciate your bikeshedding on https://github.com/python/peps/pull/4111.)

zooba commented 1 day ago

That depends on the lib. Probably, most will offer GMP-like access to internals.

The problem with the GMP-like interface would've been that the caller can specify the shape of the result. We didn't want to implement that, so it would likely have been a case where the caller has to provide exactly the settings that match our internal representation or they get an error. That's silly, so we use a different interface where we tell them how it's shaped. That's all that is different here.

serhiy-storchaka commented 1 day ago

That depends on the lib. Probably, most will offer GMP-like access to internals.

Does it work with LibTomMath (the library used in Tcl)? At least Python did not provide such access to inteomrnals before this PEP.

I would like to see some bigints library, with a different internals (say 2s complement).

It seems that cpp_int in Boost implements this. I myself implemented a minimal arbitrary fixed size integer library on templates over 20 years ago, and 2s complement was a natural choice.

skirpichev commented 13 hours ago

Does it work with LibTomMath (the library used in Tcl)?

I think so: the mp_int seems to be public. You can do similar in the flint too.

It seems that cpp_int in Boost implements this.

(Isn't that just a wrapper to the GMP/TomMath?)

arbitrary fixed size integer library

I would say this rather fits to extensions of machine integer arithmetic, i.e. of fixed precision (albeit arbitrary). Probably without need to something beyond school algorithms for arithmetic.