Closed zooba closed 9 months ago
Could you expound on why simply exposing _PyLong_AsByteArray
etc. without leading underscore is undesirable?
Right now, _PyLong_AsByteArray
is our fallback function for all the other conversions. So it doesn't handle compact ints at all, and is used in a variety of places scattered around our codebase (e.g. the struct module uses it).
Its signature doesn't lend itself to ease of readability or understanding - it takes multiple 1
or 0
arguments, and requesting "unsigned" (one of those arguments) just results in an exception raised on a negative value. It has clearly grown over time, which is something we don't want the public API to do.
It also doesn't handle undersized buffers that well. It does handle filling in the full buffer before failing, which is great, but it doesn't calculate the required size. By adding a new function on top that does that, we could later combine the two parts (copying, and calculating the required size) for greater efficiency, though I suspect simply being able to handle compact ints efficiently will be more than enough perf win to be worthwhile.
Also, leaving it as it is means that anyone currently relying on it won't be surprisingly impacted (I did add another option to skip setting exceptions, so they'll be slightly impacted, but it'll show up at compile time and can be fixed by adding a 1
, rather than compiling but having subtly different semantics.) This includes us - I'm very confident the struct module is unaffected, for example.
The reason for not just exposing the FromByteArray
is the __index__
handling and also consistency with the other new one. If we'd rather change my names back to PyLong_FromByteArray
etc. then I'd reconsider just making it a public alias, but it's already a very thin wrapper around the private one. CopyBits
is the complicated one.
I guess you don't want to use just Bytes
because that would imply bytes
object to many people? And yet Bits
is very weird since all the values count bytes (as they should). Maybe there's a better name somewhere?
Yeah, we already have a Python "bytes", and "buffer", and "bytearray". "Bits" was the first that came to me that means (to me, at least) a plain old unadorned integer value, and doesn't already have a formal Python use.
I considered "CopyNative"/"FromNative" or "CopyNativeInt"/"FromNativeInt". The latter I don't mind, though just "Native" feels a bit... abstract? I also considered "AsTwosComplement"/"FromTwosComplement" which is more accurate in some ways and less meaningful in others.
"FromNativeBytes"/"AsNativeBytes" might be the best option? I only just came up with that. It doesn't feel as "cool" as CopyBits, but maybe people are more likely to guess what it's doing at first glance?
I'd say plain "Native" is fine, but I see nothing wrong with "NativeBytes" either.
Separately, do we really need the endianness/byte-order argument? Given that we already have existing APIs for known fixed sizes 32/64/128 (I think?), it seems unlikely that people want the order to be anything other than little-endian?
do we really need the endianness/byte-order argument
Yes: there is no good default for this argument. For converting to int, you want native; for serialization to disk/network, you want a fixed one (and you should know which one you need).
we already have existing APIs for known fixed sizes 32/64/128
No, we don't. That's the immediate problem this solves.
It also solves simple serialization, another pain point, but one that could be solved separately: that way, these functions would lose the endianess argument and need a different name -- Native
would work. Trouble is, we'd soon need new functions that do take endianness.
Maybe there's a better name somewhere?
There's Export
, but that's probably better left for a much more complex API with digit size, bits in digit, and two kinds of endianness (https://github.com/python/cpython/issues/102471#issuecomment-1620352561).
Brainstorming ideas:
IntoByteArray
(to avoid suggesting it allocates the buffer)?
IntoBuffer
?
Would it be possible to have first an API proposed/blessed by consumers of the current API? Or an API based on existing API in other libraries such as GMP? I have too little knowledge to propose such API.
PyObject PyLong_FromBits(const void buffer, size_t n_bytes, int endianness)
For example, this API doesn't allow to pass a word of 32-bit and say that only the first 30 bits should be used. It looks incomplete.
I would prefer to wait until the community agrees on an API before the C API Working Group has to vote to approve or a reject a specific API.
GMP's use case (non-byte digits) is explicitly out of scope here, see above. Yes, this API is incomplete in that respect.
It's been open for discussion on a public issue for weeks now, and the only request from potential consumers is feature creep.
I think it's fine to discuss designs with you all as well without having to drag you into the main thread (I'm happy to act as the focal point for all inputs), but apart from the name, I'm confident in the design.
this API doesn't allow to pass a word of 32-bit and say that only the first 30 bits should be used. It looks incomplete.
No other API does either, so this is a feature request, not just an API request. But it's also why I was concerned about using "bits" in the name - if someone as experienced as Victor misunderstands the API, then that name is definitely not helpful!
I quite like PyLong_IntoBuffer
or PyLong_AsNativeBytes
(and their equivalents for "from"). Any other opinions/preference on these?
You can use "RawBytes" instead of "Bits" if you want to avoid the confusion with bytes objects. "Native" can be confused for "native endianness".
The FromBits functions are basically just consistent wrappers around the existing private function. They'll do an index call if necessary, but otherwise there was no gain in changing the implementation there.
Surely you mean it's PyLong_CopyBits
that does an __index__
call? The FromBits
function don't take a PyObject argument to begin with.
The return value of CopyBits is also unusual, but I believe justified.
Agreed.
I trust that Steve has explored the design space and that this is the API we need. I withdraw any bikeshedding apart from the name.
I quite like
PyLong_IntoBuffer
orPyLong_AsNativeBytes
(and their equivalents for "from"). Any other opinions/preference on these?
Let's go with Buffer
.
"Buffer" can be confused with the buffer protocol as much as "Bytes" can be confused with bytes objects ;-)
Whatever. :-)
Surely you mean it's
PyLong_CopyBits
that does an__index__
call? TheFromBits
function don't take a PyObject argument to begin with.
Yes, my bad. The wrapper converts a -1 endianness to PY_LITTLE_ENDIAN
and passes it to the private function, which only accepts an explicit little or big.
I quite like
PyLong_IntoBuffer
orPyLong_AsNativeBytes
(and their equivalents for "from"). Any other opinions/preference on these?
I actually just looked at my own quote and misread "Into" as "Int o'Buffer" 😆 Why has naming things got to be so hard!
@iritkatriel Do you have any thoughts?
Endianness is -1 for native, or 0/1 for big/little.
There are different values for endianness in https://docs.python.org/3/library/struct.html#byte-order-size-and-alignment
Is there a convention?
Is there a convention?
In our generated header file, PY_LITTLE_ENDIAN
will be 1 or 0 depending on where it was configured, so it supports passing that directly (which I would expect to happen sooner or later). Since zero is taken, -1
is the next best "don't care" value (e.g. PyUnicode_FromWideChar
accepts -1 to calculate the length based on a null char, and plenty of precedent outside our API) - making it 2
would make every reader question what it meant, and would also force adding new constants.
But also, struct
bypasses this function. It operates at a lower level, and while it could be adapted to use it, the API isn't meant to be a substitute. I'd take the endianness argument away before making it more like struct. (Also, I have no idea why "network endian" gets its own value... guess I've totally missed out on that scenario.)
After sleeping on the name and handling a few more comments on the issue from likely users, I'm choosing PyLong_AsNativeBytes
and PyLong_FromNative[Unsigned]Bytes
.
The signature and behaviour basically matches int.to_bytes
and from_bytes
in Python code (main difference is to_bytes
can't handle negative numbers), and the "native" ought to suggest that it's not a Python bytes
object. It seems some people want arbitrary sized digits (as opposed to 8-bit bytes), so hopefully using "bytes" will make it obvious that it isn't that API.
Thanks all for the discussion so far. If you have any more comments/questions, please raise them still. I'm not planning to put this into the stable ABI yet, so I don't want to hold for that discussion - I'd rather get the API into alpha and find out what works/doesn't first. But if people say to not even do that much, then I'll hold.
Apparently, these functions were added to Python 3.13 alpha4, and so the issue can be closed, no?
Over on https://github.com/python/cpython/issues/111140 and PR https://github.com/python/cpython/pull/114886 I've proposed an implementation for converting between
PyLongObject
and arbitrary-sized native integers. This is exposed (inefficiently) as the private_PyLong_AsByteArray
with a more complex signature, but I believe this design is more appropriate for long-term stability, retains the ability for us to optimise, leads users into correct usage, supports existing needs (in particular, PyO3-style language interop), and is readable in source code.Full docs on the PR, but here's the prototype and summary:
Endianness is -1 for native, or 0/1 for big/little. The expectation is that most callers will just pass -1, particularly when calling from another language, but if someone passes
PY_LITTLE_ENDIAN
from config.h then it will do the correct thing.PyLong_CopyBits
returns -1 with an exception set, or the required size of the buffer - if the result is less/equal ton_bytes
, no overflow occurred. No exception is raised for overflow, andbuffer
is always filled up ton_bytes
even if higher bytes get dropped (like a C-style downcast).The
FromBits
functions are basically just consistent wrappers around the existing private function. They'll do an__index__
call if necessary, but otherwise there was no gain in changing the implementation there.About the only thing I don't love is using "bits" when the argument is the number of bytes. Taking bytes is nice for what I consider the "normal" case of
PyLong_CopyBits(v, &int32_value, sizeof(int32_value), -1)
, andCopyBytes
didn't seem to imply as low-level an operation as copying bits. There's some discussion ofAs/FromByteArray
on the issue.The return value of
CopyBits
is also unusual, but I believe justified. The callers are very unlikely to simply return anOverflowError
back into Python, and will more likely clear it and do something different (perhaps allocating dynamic memory to copy the bits to), or just don't care and are happy to truncate (because they've already range checked).I believe these are good candidates for the limited API, but I'm happy to ship them first and gather feedback before pushing for that. The current limited API alternative is to use
PyObject_CallMethod(v, "to_bytes" ...)
and copy out of thebytes
object.Any thoughts/suggestions/concerns about these APIs?