Proposal: Reduce dependency on file descriptors

LasseBlaauwbroek commented 1 year ago

Pycapnp mostly works by passing file descriptors into the capnproto c++ library. This causes some problems:

Python wrappers around file descriptors are non-functional. For example, when reading messages from a file, gzip.open cannot be used. Similarly, Pythons socket SSL wrapper cannot be used in synchronous mode. A workaround exists to use ssl in asyncio mode, but that is also a hack.
In synchronous mode, blocking operations cannot be interrupted by Python. See https://github.com/capnproto/pycapnp/issues/298 and https://github.com/capnproto/capnproto/issues/1542.
It is difficult to get file descriptor asyncio read/write to be cross-platform due to platform support limitations.

To remedy this, I propose that instead of passing file descriptors to capnproto, we create wrappers around Python's IO facilities.

[ ] For synchronous file operations, we write a subclass of kj::InputStream and kj::OutputStream that wraps a Python file object
- For synchronous socket operations, one can obtain a file-like object by calling socket.makefile. (This would also work with ssl.SSLSocket)
[x] For asyncio socket operations, we write a subclass of kj::AsyncInputStream and kj:AsyncOutputStream that wraps a StreamReader and StreamWriter.
[ ] For asyncio file operations, it might be possible to use aiofile or aiofiles with the same interface as the previous point. To be investigated.

With these facilities, all classes like TwoPartyClient and TwoPartyServer will be modified to receive one of these wrapper classes. That should resolve all of the aforementioned problems.

A potential downside is that everything might slow down, because read and write operations have to be routed through the Python interpreter. I haven't measured the impact of this. If this is a problem, I guess we can also keep the file descriptor API (with it's known deficiencies). Downside of this, is that the file descriptor API requires a fairly large amount of code to work (see asyncProvider.cpp in #310)

kentonv commented 1 year ago

This is @haata's call but FWIW this plan makes a lot of sense to me.

haata commented 1 year ago

Yeah, I think this makes sense (it definitely would make things a lot more straightforward).

A potential downside is that everything might slow down, because read and write operations have to be routed through the Python interpreter. I haven't measured the impact of this. If this is a problem, I guess we can also keep the file descriptor API (with it's known deficiencies). Downside of this, is that the file descriptor API requires a fairly large amount of code to work (see asyncProvider.cpp in #310)

I haven't investigated these APIs specifically, but it's usually possible to call the CPython apis directly. If done the right way, you shouldn't have to pay the Python interpreter tax (or we may need to add python dependency with a C version of that specific api).

Before you spend too much effort, it might be worth measuring the performance of read/write operations with some small POC code.

kentonv commented 1 year ago

Surely you can have an optimization where you detect if the stream is backed by a raw FD and, if so, use it raw, but if not, then use the streams that call back into Python?

LasseBlaauwbroek commented 1 year ago

I haven't investigated these APIs specifically, but it's usually possible to call the CPython apis directly.

I'm not aware of any C-level or Cython api's for either file objects or sockets. Everything goes through io and socket for sync communication and transports and protocols or streams for async communication. None of those seem to expose a C-level API. But then, there certainly is no low-level API for the Gzip and SSL wrapper.

Surely you can have an optimization where you detect if the stream is backed by a raw FD and, if so, use it raw, but if not, then use the streams that call back into Python?

It is certainly possible to just pass in a raw fd (int) to the library, and use if you need maximum performance. Whether or not we can autodetect a stream being backed by a raw FD is tricky. For example, gzip.open returns a file object with a fileno() function. But that returns the file descriptor of the underlying compressed file, which is unusable. So one would have to use a pretty conservative heuristic to do such auto-detection. Even then, this would still cause issues in synchronous mode where kj file reads cannot be interrupted and the cross-platform issues.

Personally, I would be inclined to forego such heuristics unless the resulting performance is absolutely abysmal. Python is a notoriously extremely slow language, and most python developers prefer convenience over speed (to my dismay). The people that want maximum speed can still just pass in a raw fd int.

kentonv commented 1 year ago

For example, gzip.open returns a file object with a fileno() function. But that returns the file descriptor of the underlying compressed file, which is unusable.

Oh man. Yeah that makes fileno() essentially unusable. IMO that's an egregious bug in the gzip package but I suppose there's nothing anyone can do about that now.

LasseBlaauwbroek commented 1 year ago

Now that #312 exists, I'd like to take this proposal one step further. I propose to entirely remove the ability to run capnproto RPC without Python's asyncio loop. In my opinion, the existing functionality does not work with capnproto's philosophy, which is fundamentally based on a well-integrated event loop, and leads to many bugs.

Take for example thread_server.py and thread_client.py:

Instead of just waiting for the server to complete in blocking style, one has to poll in order to allow Ctrl+C to come through. See https://github.com/capnproto/capnproto/issues/1542
The client has to start two threads, with two separate connections to the server. This is just kind of silly...
The client regularly segfaults with FATAL: exception not rethrown, which is difficult to impossible to fix AFAIK.
The need to chain promises with then() and lambdas is awkward at best.

When we remove this functionality, lots of other code can also be removed. So, I propose the following:

Only allow starting a KJ event loop when a Python asyncio loop is already running.
Remove the _RemotePromise.wait() and related methods. Use await promise instead.
Require all capability implementation methods to be async.
Remove _RemotePromise.then(). The new functionality in #312 can be used.
Remove KJ timer integration. Python's asyncio timers can be used.
Probably entirely remove the _Promise and _VoidPromise wrappers, as well as _PromiseFulfillerPair and join_promises and friends. They are no longer needed. The only reason to keep _RemotePromise around is for promise pipelining.
TwoPartyClient and TwoPartyServer will only accept a _AsyncIoStream as their stream. No socket or raw fd.

This would obviously be a large breaking change. But #310 is already breaking and I think we might as well double down. I don't personally have the cycles to ensure a lengthy deprecation phase. And I suspect nobody does...

Thoughts?

kentonv commented 1 year ago

I don't know anything about Python asyncio but abstractly speaking this proposal sounds like the right thing.

haata commented 1 year ago

Yeah, this is a breaking change. But I agree with it. Asyncio was the conclusion I reached after lots of fighting with libcapnproto file descriptors.

If we're going to be incrementing to v2.0.0 anyways, this is the time to do it. Making pycapnp more pythonic helps everyone in the long run.

The simplification is also a huge pro IMO.

LasseBlaauwbroek commented 1 year ago

In that case, I'd suggest that a v2.0.0 feature branch is made and that - after proper review - my open PR's are merged there.

LasseBlaauwbroek commented 1 year ago

With #315, I believe that the only thing remaining in this issue is to clean up the _MessageReader and MessageWriter class hierarchy, which are involved in reading and writing messages from disk. Those currently still rely on naked file descriptors, which needs to be rewritten to use Python's own file objects.

fungs commented 5 months ago

For example, gzip.open returns a file object with a fileno() function. But that returns the file descriptor of the underlying compressed file, which is unusable.

Oh man. Yeah that makes fileno() essentially unusable. IMO that's an egregious bug in the gzip package but I suppose there's nothing anyone can do about that now.

There is bug for this in the Python issue tracker: https://github.com/python/cpython/issues/100066 . I hope it will go away, but a similar bug exists since 2015.

fungs commented 5 months ago

I'm also looking forward to having a simple way to use (sync) file object wrappers to read and write messages. This is important for integration with other Python functionality for compression, encryption, databases etc. My current, ugly workaround is to spawn a separate process to read the source data stream and feed it into a pipe or socket, which is then consumed in the parent process. UNIX pipes and sockets are OS low level tools which provide a working fileno() method in Python. There seem to be issues with this approach in a thread instead of a process (#354).

capnproto / pycapnp

Proposal: Reduce dependency on file descriptors #311