Closed LasseBlaauwbroek closed 1 year ago
Very cool to see this, I haven't looked that closely but from the description this sounds like the right way to solve this problem. It's nice to see it is finally happening. I'm pretty overloaded so will have to mostly leave this review to @haata but let me know if there are any specific questions I can help answer.
Most of the code in capnp/helpers/asyncProvider.cpp was taken from @kentonv's nodejs implementation. So I'm hereby asking for permission to copy that. If that's okay, let me know what kind of attribution you might want.
The license is BSD 2-clause, which does not require permission: https://github.com/capnproto/node-capnp/blob/node14/LICENSE
I am not a lawyer and this is not legal advice, but what I typically do is put a comment in the code saying something like "derived from X, which has the following license:" then reproduce the copyright and license file.
Thanks @kentonv. The licensing is a bit tricky, because the file this is taken from lists MIT at the top, while the repository license specifies BSD: https://github.com/capnproto/node-capnp/blob/node10/src/node-capnp/capnp.cc
Oh interesting, apparently I changed the license in: https://github.com/capnproto/node-capnp/commit/9b3e84b11164fef2a5e3976c993e153a8a911777
But I forgot to update the LICENSE file.
The two licenses are functionally equivalent. I'm happy with you citing either one of them.
Nice! I'll try to take some time this week to play around with the code as well. Some things to watch out for
Okay, everything is basically functional now. The main problem I'm trying to deal with now is the clean shutdown of the event loop. This leads to some problems in clients running this interface: The client connects to a server and registers a StatusSubscriber
interface with the server. The server then keeps calling this interface even when the client is already done with its "main" actions. When I just exit the asyncio loop, things error out because the KJ loop is still running...
@kentonv could you suggest how to shut that down gracefully? I was looking at TwoPartyVatNetwork::shutdown()
but that is a private function...
All tests now pass on linux and macos. Windows still fails due to usage of unix-specific apis. But otherwise, this is probably ready for a round of review.
Yeah, it looks like something like this for Windows https://stackoverflow.com/questions/20543940/where-do-i-get-arpa-inet-h
@kentonv, I'm still having trouble cleanly shutting down the event loop. I cannot control when Python's event loop shuts down, and it makes sense to me to shut down the kj event loop as well at that point. Even if some promises are still running. Otherwise, errors will occur.
The problem here, is that after shutting down the loop, references to objects like TwoPartyClient
, DynamicCapability::Client
and other may still exist. I cannot control that. As a result, when the WaitScope
object I have allocated is destroyed before such objects are destroyed, I tend to get segmentation faults. It seems that the destructors of these objects assume that a running kj loop exists.
Is it just plain impossible to shut down the event loop while objects that are using the loop still exist?
Is it just plain impossible to shut down the event loop while objects that are using the loop still exist?
Yes, this leads to dangling pointers and use-after-free.
I haven't looked at your code so this may or may not make sense, but I imagine the solution would be something like:
@kentonv that makes sense. However:
[..] your EventPort implementation should respond by causing all current and future promises to reject with an exception
What is the best way to do this? As far as I can see, neither the EventPort nor the EventLoop can wholesale cancel all promises. Should I create a giant kj::Canceller
owned by the EventPort, and wrap every promise ever created into that canceller?
Sorry, to clarify, by "all current and future promises" I specifically meant promises created by the EventPort itself to represent fundamental events. Promises created by the app need not be canceled explicitly, as long as the EventLoop still exists.
@kentonv I just tried my best interpretation of your approach:
[..] your EventPort implementation should respond by causing all current and future promises to reject with an exception. [..] I specifically meant promises created by the EventPort itself to represent fundamental events.
I'm not exactly sure what you mean by 'fundamental events'. Are you talking about events in the Python loop? Those have already been cancelled during the closing of the event loop...
Basically, what I need to decide is how to respond to calls of EventPort::setRunnable()
and friends when no Pyhon loop exists anymore. I see three options:
setRunnable
is invoked by C++ code, which doesn't understand Python exceptions (what happens in practice is that a warning about the exception gets printed to stdout).setRunnable()
without scheduling any event. I'm not sure what this means for the objects that rely on the loop? Will this mean that the destructors of KJ I/O classes will potentially hang indefinitely?I'm not exactly sure what you mean by 'fundamental events'.
I mean the promises that the EventPort itself returns. As opposed to other promises created by the app derived from those. My point here is you do not need to cancel and destroy every existing promise, you just need everything which is specifically waiting on an event from the EventPort to receive an exception.
Presumably, once the Python event loop is gone, then the KJ event loop can only continue until it is empty. Once it's empty, then there's no way for it to become non-empty again, because no further events can ever be delivered... I think? I guess if Python code is still running it can still call pycapnp in a way that schedules new events, but maybe the answer is to have all the Python functions throw Python exceptions if the event loop no longer exists?
Will this mean that the destructors of KJ I/O classes will potentially hang indefinitely?
No, destructors never wait for the loop. They might queue an event, but they expect that event to run later, they don't wait for it.
I mean the promises that the EventPort itself returns.
Unfortunately, I still don't understand what that means. As far as I can tell, the EventPort
API never returns a kj::Promise
. Or any other promise for that matter. The only thing it does is schedule an event to be run in the Python loop...
Presumably, once the Python event loop is gone, then the KJ event loop can only continue until it is empty.
I think that if (1) the Python event loop is gone and (2) the KJ event loop is non-empty, the KJ event loop will never become empty again in a non-exceptional way. Once Python's loop is gone, I have no way of scheduling events through EventPort::setRunnable
anymore. So the only way for the loop to become empty is if all the promises are canceled. But I guess that's fine. When Python's asyncio loop stops, it cancels all coroutines that run in the background, so that should cancel most KJ promises. For anything else, we can expect the user to clean up after himself (or just exit the program).
I guess if Python code is still running it can still call pycapnp in a way that schedules new events, but maybe the answer is to have all the Python functions throw Python exceptions if the event loop no longer exists?
Yes, that is my plan.
No, destructors never wait for the loop.
Thanks, that's good to know.
As far as I can tell, the EventPort API never returns a kj::Promise. Or any other promise for that matter.
OK I see the confusion. kj::UnixEventPort
does sport an API for listening on various low-level events, e.g. waiting for a file descriptor to become readable or writable. Everything else (e.g. the AsyncIoProivder implementation) is built on top of that. It has to be that way because UnixEventPort
implements the listening, via epoll, so it has to know about all events it's listening for.
Your code here is a bit different, the AsyncIoProvider implementation uses the Python APIs directly to listen for events, so the EventPort
is sort of a side object that's just used to control the event loop.
Anyway, what I meant was, for example, the promise OwnedFileDescriptor::onReadable()
should be designed to reject with an exception if the Python event loop shuts down. Probably the easiest way to accomplish this is to have a kj::Canceler
that you use to wrap the returned promise, and cancel it at shutdown.
OTOH, you do NOT need to wrap the promise returned by, say, PyIoStream::tryRead()
. You can assume the exception from the lower-level onReadable()
promise will propagate up to cause this to reject as well.
Once Python's loop is gone, I have no way of scheduling events through EventPort::setRunnable anymore.
I see, for some reason I was imagining that the python event loop would be shut down as a result of a call made by the app during a KJ event, i.e. the KJ event loop is currently running and will continue to empty its queue.
But I realize now I have no idea when the python event loop shuts down. In fact it sounds to me like something that shouldn't really happen except at program exit? Anyway, I guess I don't know the right answer in that case.
Anyway, what I meant was, for example, the promise OwnedFileDescriptor::onReadable() should be designed to reject with an exception if the Python event loop shuts down.
Okay, yes, I get it now. That makes a lot of sense. Thanks.
Your code here is a bit different, the AsyncIoProvider implementation uses the Python APIs directly to listen for events
For the record: I'm not very happy with Python's API, because its cross-platform support is somewhat lacking. I'd be happier to reuse some KJ API that lets me poll for events. But as far as I can see, the code for that is not publicly exposed. And I'm not super keen on copying all that code over and maintaining it separately. Do you have a better solution for that?
But I realize now I have no idea when the python event loop shuts down
Python has a asyncio.run(my_coroutine)
function that starts up an event loop and shuts it down when the coroutine is done. This function is usually run top-level and only once. But you could potentially run it multiple times (which will start multiple subsequent event loops). But it's possible to leak async objects from asyncio.run
. This happens, for example, when you have some background task that pings a server every so often. If the user doesn't shut that task down properly, it might leak. But I think that with your comments I can solve that now.
I'd be happier to reuse some KJ API that lets me poll for events. But as far as I can see, the code for that is not publicly exposed.
This is hard because at the bottom of the stack, kj::UnixEventPort is waiting for all I/O events using a single unified poll() loop (or using epoll on Linux, kqueue on BSD, etc.), and you can't really have two poll() loops running in the same thread. The Python asyncio system presumably has its own poll loop, so you can't use UnixEventPort at the same time.
I tried to design the KJ APIs to allow building on top of some other system's event loop. This is exactly what you did in this PR.
It's possible there's a different level of abstraction that would have worked better here, like maybe if UnixEventPort's API were abstract, you could implement that directly on top of the Python event loop, and then you'd be able to reuse the AsyncIoProvider implementation that sits on UnixEventPort. I'm not really sure, though, because UnixEventPort is very much tied to file descriptors and unix-isms -- it makes no sense on Windows. But I'd expect many of these other event loops are designed to be somewhat cross-platform. AsyncIoProvider is the only abstraction I've come up with which avoids being platform-specific.
Okay, fair enough. I guess we'll just have to live with Python's API here. And if we ever do https://github.com/capnproto/pycapnp/issues/311, then the situation will improve because Python's higher level API's are more cross-platform.
Thanks!
I made a bunch more changes. Summarizing:
kj::Promise
is wrapped in Python classes, fixing a large amount of pre-existing memory leaks. See this commit message for further explanation: https://github.com/capnproto/pycapnp/pull/310/commits/0c013116c81e99aee589a055477f704875ab855ca_wait()
: https://github.com/capnproto/pycapnp/pull/310/commits/60491522f2518086d0be15ce509dd5c7d61932a5The only major thing that still remains to be done is porting asyncProvider.cpp
to Windows.
@kentonv, sorry for all the questions, but my systems programming skills are really being pushed to the limit here, especially on windows. I'm now trying to port to windows, where my strategy is basically to copy large parts of async-io-win32.c++
(very early attempt here). This seems somewhat trickier than for Unix though, because this file is much more intertwined with the Win32EventPort
. In particular, it uses Win32EventPort::observeIo(HANDLE)
and related operations.
Question:
Is it safe for me to create a Win32EventPort
and use it only to call observeIo
? (Handling of other signals would still be done through Python.)
If not, could you suggest a simple way to adapt the code in async-io-win32.c++
? My Windows programming knowledge is really close to zero.
Thinking about the cross-platform issue a bit more: I'm starting to think that we should just double down on my proposal in https://github.com/capnproto/pycapnp/issues/311 and entirely get rid of the use of raw file descriptors when the asyncio loop is active. That way we can entirely circumvent all this rather tricky, low-level system programming. It will probably be a little slower, but I think that if we implement it cleverly using buffered streaming protocols the performance should be acceptable.
That will allow us to get rid of all the new code in asyncProvider.cpp
, and solve all the cross-platform issues. Thoughts, @haata and @kentonv ?
No, you can't just use a Win32EventPort
in the same thread as the python event loop. The problem is that Win32EventPort
, like UnixEventPort
, assumes that whenever the event loop is empty, it'll call eventPort.wait()
to wait for the next event. But in your setup on top of the python event loop, eventPort.wait()
is never used, because Python's event loop is the one that does the waiting. So if you created a Win32EventPort
and tried to use it, it would never produce any events.
So, you will need to figure out how to wait on Win32 handles using the Python event loop APIs. Yeah, it does sound like #311 might be part of the answer here.
This was something that I also had difficulty with, the file descriptors just didn't work well when interacting with Python. Working with something Python supports directly (cross-platform) would make a lot of this easier to think about/debug for a python developer. Performance can always be tuned later.
I have now implemented AsyncIoStream
based on Python's Transport/Protocol functionality. It is quite tricky to do this wrapping, because propagating error conditions and cancellation between AsyncIoStream
and Protocol
is nontrivial. As such, there are probably some subtle bugs. Additionally, the Python implementation of transports/protocols is itself quite buggy and I had to work around three distinct bugs...
This PR is now fully cross-platform. As such, I believe that this is now a candidate for merging.
This PR introduces a number of backwards-incompatible changes:
capnp.remove_event_loop()
has been removed. It is no longer needed.TwoPartyServer.write()
and TwoPartyClient.write()
methods are now async
.address:port
string to TwoPartyClient()
and TwoPartyServer()
in asyncio mode. This would require the initializer to be marked async
, which is not possible. (And I don't personally like passing strings for addresses in there anyway, so I'd rather get rid of all string passing at some point).As such, I would propose that after this is merged, a 2.0 beta 1
release is made. This way, we can have backwards incompatible changes, and everything can be tested by a wider audience.
@haata Gentle reminder that this is ready for final review and merging.
Thanks for the reminder (somehow I'll find some time to do this by the end of the weekend).
@LasseBlaauwbroek can you summarize a changelog entry for me? You don't need to add it to the file (I can do that later), just add it as a comment here. After that I'll merge it in.
Thanks for merging, and sorry for the lack of reply. My attention was shifted away from pycapnp (but it will probably return at some point).
Now that this is merged, I rebased #312 #312. They should be ready for reviewing/merging.
Fix #256
This PR attempts to remove the slow and expensive polling behavior for asyncio in favor of proper linking of the KJ event loop to the asyncio event loop.
This is a highly experimental proof-of-concept. There are probably memory leaks all over the place; my knowledge of both the C++ side and the Cython side is limited. However, early feedback from @kentonv, @haata, and others would be highly appreciated. Also, don't hesitate to push any and all improvement to this branch (or another branch).
Current state: The example in
examples/async_client.py
is functional, but its counterpartexamples/async_server.py
is not because timers are not yet implemented. So to run this, the server needs to be started with a version of pycapnp without these changes.Most of the code in
capnp/helpers/asyncProvider.cpp
was taken from @kentonv's nodejs implementation. So I'm hereby asking for permission to copy that. If that's okay, let me know what kind of attribution you might want.