Open LasseBlaauwbroek opened 1 year ago
This is @haata's call but FWIW this plan makes a lot of sense to me.
Yeah, I think this makes sense (it definitely would make things a lot more straightforward).
A potential downside is that everything might slow down, because read and write operations have to be routed through the Python interpreter. I haven't measured the impact of this. If this is a problem, I guess we can also keep the file descriptor API (with it's known deficiencies). Downside of this, is that the file descriptor API requires a fairly large amount of code to work (see
asyncProvider.cpp
in #310)
I haven't investigated these APIs specifically, but it's usually possible to call the CPython apis directly. If done the right way, you shouldn't have to pay the Python interpreter tax (or we may need to add python dependency with a C version of that specific api).
Before you spend too much effort, it might be worth measuring the performance of read/write operations with some small POC code.
Surely you can have an optimization where you detect if the stream is backed by a raw FD and, if so, use it raw, but if not, then use the streams that call back into Python?
I haven't investigated these APIs specifically, but it's usually possible to call the CPython apis directly.
I'm not aware of any C-level or Cython api's for either file objects or sockets. Everything goes through io and socket for sync communication and transports and protocols or streams for async communication. None of those seem to expose a C-level API. But then, there certainly is no low-level API for the Gzip and SSL wrapper.
Surely you can have an optimization where you detect if the stream is backed by a raw FD and, if so, use it raw, but if not, then use the streams that call back into Python?
It is certainly possible to just pass in a raw fd (int
) to the library, and use if you need maximum performance. Whether or not we can autodetect a stream being backed by a raw FD is tricky. For example, gzip.open
returns a file object with a fileno()
function. But that returns the file descriptor of the underlying compressed file, which is unusable. So one would have to use a pretty conservative heuristic to do such auto-detection. Even then, this would still cause issues in synchronous mode where kj file reads cannot be interrupted and the cross-platform issues.
Personally, I would be inclined to forego such heuristics unless the resulting performance is absolutely abysmal. Python is a notoriously extremely slow language, and most python developers prefer convenience over speed (to my dismay). The people that want maximum speed can still just pass in a raw fd int.
For example, gzip.open returns a file object with a fileno() function. But that returns the file descriptor of the underlying compressed file, which is unusable.
Oh man. Yeah that makes fileno()
essentially unusable. IMO that's an egregious bug in the gzip package but I suppose there's nothing anyone can do about that now.
Now that #312 exists, I'd like to take this proposal one step further. I propose to entirely remove the ability to run capnproto RPC without Python's asyncio loop. In my opinion, the existing functionality does not work with capnproto's philosophy, which is fundamentally based on a well-integrated event loop, and leads to many bugs.
Take for example thread_server.py and thread_client.py:
FATAL: exception not rethrown
, which is difficult to impossible to fix AFAIK.then()
and lambdas is awkward at best.When we remove this functionality, lots of other code can also be removed. So, I propose the following:
_RemotePromise.wait()
and related methods. Use await promise
instead.async
._RemotePromise.then()
. The new functionality in #312 can be used._Promise
and _VoidPromise
wrappers, as well as _PromiseFulfillerPair
and join_promises
and friends. They are no longer needed. The only reason to keep _RemotePromise
around is for promise pipelining.TwoPartyClient
and TwoPartyServer
will only accept a _AsyncIoStream
as their stream. No socket or raw fd.This would obviously be a large breaking change. But #310 is already breaking and I think we might as well double down. I don't personally have the cycles to ensure a lengthy deprecation phase. And I suspect nobody does...
Thoughts?
I don't know anything about Python asyncio but abstractly speaking this proposal sounds like the right thing.
Yeah, this is a breaking change. But I agree with it. Asyncio was the conclusion I reached after lots of fighting with libcapnproto file descriptors.
If we're going to be incrementing to v2.0.0 anyways, this is the time to do it. Making pycapnp more pythonic helps everyone in the long run.
The simplification is also a huge pro IMO.
In that case, I'd suggest that a v2.0.0
feature branch is made and that - after proper review - my open PR's are merged there.
With #315, I believe that the only thing remaining in this issue is to clean up the _MessageReader
and MessageWriter
class hierarchy, which are involved in reading and writing messages from disk. Those currently still rely on naked file descriptors, which needs to be rewritten to use Python's own file objects.
For example, gzip.open returns a file object with a fileno() function. But that returns the file descriptor of the underlying compressed file, which is unusable.
Oh man. Yeah that makes
fileno()
essentially unusable. IMO that's an egregious bug in the gzip package but I suppose there's nothing anyone can do about that now.
There is bug for this in the Python issue tracker: https://github.com/python/cpython/issues/100066 . I hope it will go away, but a similar bug exists since 2015.
I'm also looking forward to having a simple way to use (sync) file object wrappers to read and write messages. This is important for integration with other Python functionality for compression, encryption, databases etc. My current, ugly workaround is to spawn a separate process to read the source data stream and feed it into a pipe or socket, which is then consumed in the parent process. UNIX pipes and sockets are OS low level tools which provide a working fileno()
method in Python. There seem to be issues with this approach in a thread instead of a process (#354).
Pycapnp mostly works by passing file descriptors into the capnproto c++ library. This causes some problems:
To remedy this, I propose that instead of passing file descriptors to capnproto, we create wrappers around Python's IO facilities.
kj::InputStream
andkj::OutputStream
that wraps a Python file objectkj::AsyncInputStream
andkj:AsyncOutputStream
that wraps a StreamReader and StreamWriter.With these facilities, all classes like
TwoPartyClient
andTwoPartyServer
will be modified to receive one of these wrapper classes. That should resolve all of the aforementioned problems.A potential downside is that everything might slow down, because read and write operations have to be routed through the Python interpreter. I haven't measured the impact of this. If this is a problem, I guess we can also keep the file descriptor API (with it's known deficiencies). Downside of this, is that the file descriptor API requires a fairly large amount of code to work (see
asyncProvider.cpp
in #310)