capnproto / pycapnp

Cap'n Proto serialization/RPC system - Python bindings
BSD 2-Clause "Simplified" License
458 stars 125 forks source link

Proposal: Reduce dependency on file descriptors #311

Open LasseBlaauwbroek opened 1 year ago

LasseBlaauwbroek commented 1 year ago

Pycapnp mostly works by passing file descriptors into the capnproto c++ library. This causes some problems:

To remedy this, I propose that instead of passing file descriptors to capnproto, we create wrappers around Python's IO facilities.

With these facilities, all classes like TwoPartyClient and TwoPartyServer will be modified to receive one of these wrapper classes. That should resolve all of the aforementioned problems.

A potential downside is that everything might slow down, because read and write operations have to be routed through the Python interpreter. I haven't measured the impact of this. If this is a problem, I guess we can also keep the file descriptor API (with it's known deficiencies). Downside of this, is that the file descriptor API requires a fairly large amount of code to work (see asyncProvider.cpp in #310)

kentonv commented 1 year ago

This is @haata's call but FWIW this plan makes a lot of sense to me.

haata commented 1 year ago

Yeah, I think this makes sense (it definitely would make things a lot more straightforward).

A potential downside is that everything might slow down, because read and write operations have to be routed through the Python interpreter. I haven't measured the impact of this. If this is a problem, I guess we can also keep the file descriptor API (with it's known deficiencies). Downside of this, is that the file descriptor API requires a fairly large amount of code to work (see asyncProvider.cpp in #310)

I haven't investigated these APIs specifically, but it's usually possible to call the CPython apis directly. If done the right way, you shouldn't have to pay the Python interpreter tax (or we may need to add python dependency with a C version of that specific api).

Before you spend too much effort, it might be worth measuring the performance of read/write operations with some small POC code.

kentonv commented 1 year ago

Surely you can have an optimization where you detect if the stream is backed by a raw FD and, if so, use it raw, but if not, then use the streams that call back into Python?

LasseBlaauwbroek commented 1 year ago

I haven't investigated these APIs specifically, but it's usually possible to call the CPython apis directly.

I'm not aware of any C-level or Cython api's for either file objects or sockets. Everything goes through io and socket for sync communication and transports and protocols or streams for async communication. None of those seem to expose a C-level API. But then, there certainly is no low-level API for the Gzip and SSL wrapper.

Surely you can have an optimization where you detect if the stream is backed by a raw FD and, if so, use it raw, but if not, then use the streams that call back into Python?

It is certainly possible to just pass in a raw fd (int) to the library, and use if you need maximum performance. Whether or not we can autodetect a stream being backed by a raw FD is tricky. For example, gzip.open returns a file object with a fileno() function. But that returns the file descriptor of the underlying compressed file, which is unusable. So one would have to use a pretty conservative heuristic to do such auto-detection. Even then, this would still cause issues in synchronous mode where kj file reads cannot be interrupted and the cross-platform issues.

Personally, I would be inclined to forego such heuristics unless the resulting performance is absolutely abysmal. Python is a notoriously extremely slow language, and most python developers prefer convenience over speed (to my dismay). The people that want maximum speed can still just pass in a raw fd int.

kentonv commented 1 year ago

For example, gzip.open returns a file object with a fileno() function. But that returns the file descriptor of the underlying compressed file, which is unusable.

Oh man. Yeah that makes fileno() essentially unusable. IMO that's an egregious bug in the gzip package but I suppose there's nothing anyone can do about that now.

LasseBlaauwbroek commented 1 year ago

Now that #312 exists, I'd like to take this proposal one step further. I propose to entirely remove the ability to run capnproto RPC without Python's asyncio loop. In my opinion, the existing functionality does not work with capnproto's philosophy, which is fundamentally based on a well-integrated event loop, and leads to many bugs.

Take for example thread_server.py and thread_client.py:

When we remove this functionality, lots of other code can also be removed. So, I propose the following:

This would obviously be a large breaking change. But #310 is already breaking and I think we might as well double down. I don't personally have the cycles to ensure a lengthy deprecation phase. And I suspect nobody does...

Thoughts?

kentonv commented 1 year ago

I don't know anything about Python asyncio but abstractly speaking this proposal sounds like the right thing.

haata commented 1 year ago

Yeah, this is a breaking change. But I agree with it. Asyncio was the conclusion I reached after lots of fighting with libcapnproto file descriptors.

If we're going to be incrementing to v2.0.0 anyways, this is the time to do it. Making pycapnp more pythonic helps everyone in the long run.

The simplification is also a huge pro IMO.

LasseBlaauwbroek commented 1 year ago

In that case, I'd suggest that a v2.0.0 feature branch is made and that - after proper review - my open PR's are merged there.

LasseBlaauwbroek commented 1 year ago

With #315, I believe that the only thing remaining in this issue is to clean up the _MessageReader and MessageWriter class hierarchy, which are involved in reading and writing messages from disk. Those currently still rely on naked file descriptors, which needs to be rewritten to use Python's own file objects.

fungs commented 5 months ago

For example, gzip.open returns a file object with a fileno() function. But that returns the file descriptor of the underlying compressed file, which is unusable.

Oh man. Yeah that makes fileno() essentially unusable. IMO that's an egregious bug in the gzip package but I suppose there's nothing anyone can do about that now.

There is bug for this in the Python issue tracker: https://github.com/python/cpython/issues/100066 . I hope it will go away, but a similar bug exists since 2015.

fungs commented 5 months ago

I'm also looking forward to having a simple way to use (sync) file object wrappers to read and write messages. This is important for integration with other Python functionality for compression, encryption, databases etc. My current, ugly workaround is to spawn a separate process to read the source data stream and feed it into a pipe or socket, which is then consumed in the parent process. UNIX pipes and sockets are OS low level tools which provide a working fileno() method in Python. There seem to be issues with this approach in a thread instead of a process (#354).