capi-workgroup / problems

Discussions about problems with the current C Api
19 stars 6 forks source link

Concrete strings #16

Open encukou opened 1 year ago

encukou commented 1 year ago

From https://github.com/jpype-project/jpype/discussions/1071#discussioncomment-2835357 :

In theory Java and Python have compatible definitions of strings. Both are immutable and thus one should just be able to wrap a Java string as a Python string and be done. However, Strings are not a protocol but rather concrete objects, thus I can't just implement an interface for the C API of Python string and make them compatible. This failing forces either immediate conversion (which itself is problematic if a string is very large and the user does not intend to work with in Python extensively), or use a conversion. This problem does not just affect Java wrappers, but Qt wrappers and many other language bindings where immutable strings are available.

gvanrossum commented 1 year ago

It feels like this is going to be a tough sell if it has a noticeable performance effect on how strings are typically used by CPython. OTOH the unicode object already has many representations under the hood. Maybe it would be possible to add another? It would have to wrap and own the Java string object. (Something would have to wrap and own it, we can't just cast a Java object pointer to PyObject *.)

Thrameos commented 1 year ago

The only requirement here is that when using the string something needs to be called to prepare the string before usage and the memory for the string may be stored elsewhere. What will happen in for Java or C# will be that string ready will check to see if the object has already been transferred in which case it will return immediately, or it will call some routine to make the memory available using one of the existing Python protocols. Java uses a funny encoding that is neither UTF8 nor UTF16 but something inbetween. When the string is destroyed it would then need to release the memory.

For non-abstract strings it would just check the slot and find there was no abstract string slot so it proceeds. So the cost would be one slot check per usage.

Thus the proposal would be that there was a PyStringReady() slot and slot for destroying the string. The memory space for the string could hold a pointer to the external memory. Bindings would use lazy transferring to move their string into Python when ready is called and check to see if the string was ever readied in which case it would release the external memory.

Ideally this all happens behind the scenes such that there is never any changes on the users side. Calls that access the string data (ie PyUnicode_1BYTE_DATA, PyUnicode_2BYTE_DATA, PyUnicode_4BYTE_DATA) and those that are reporting (PyUnicode_KIND) call the ready slot which causes all the fields in the string to be filled out.

gvanrossum commented 1 year ago

I'm not sure I follow all that, but fortunately, this issue tracker is not for solutions but for problems, and the problem seems clear enough.

encukou commented 1 year ago

I think this is very solvable, thanks to Inada-san's work in improving the PyUnicode API. (And I'd enjoy solving it, but can't fit it in my priorities.)

FWIW, this would enable adding performant (but somewhat tricky to use) API for zero-copy strings e.g. from mapped files or from/to languages like Rust.

ronaldoussoren commented 11 months ago

FWIW: I also ran into this with PyObjC, which contains a subtype of PyUnicode_Type just to be able to use Objective-C strings transparently with extension functions that expect a string arguments.

That subtype is inherently fragile because its implementation uses implementation details of PyUnicode_Type. Luckily that implementation hasn't seen a lot of changes so far, other than the migration to the current representation earlier in Python 3's development.

A problem with integration could be the representation of foreign strings, e.g. Java and Objective-C strings logically are UCS2 while Python's string is UCS4. That can probably be solved by using UTF-8 in a hypothetical string protocol.