Closed ZmnSCPxj closed 4 years ago
I have been implementing this in my branch, but I am now considering some interesting questions.
First of all, send_outreq
inherently assumes that while it is blocked waiting for a command it gives to lightningd
, the cmd
it allocated from will not be tal_free
d. This is true in the current each-command-is-a-single-thread model, since the only way to destroy the cmd
would be either before send_outreq
or in the callback given to send_outreq
.
The assumption is inherent here because send_outreq
constructs a struct out_req
that is tal-allocated from the input struct command
(which can be NULL), then inserts it into a global map of command IDs to out_req
structures. But send_outreq
does not install any destructors on the struct out_req
. If the struct command
is deleted while a send_outreq
is pending, then when lightningd
responds, the map of command IDs would return an already-freed struct out_req
. This is not an issue now since a struct command
currently represents only a single "thread" and while that thread is blocked on a send_outreq
then the struct command
cannot be freed.
(Inside lightningd
itself the struct command
could get freed by the connection to the client getting lost, but inside plugins this is not so, since only lightningd
talks to them and they will get killed if lightningd
dies anyway)
So the easy way to implement this concurrency is to simply disallow access to the struct command
inside sparks (i.e. spark callbacks do not receive a struct command *
argument, and would be forced to pass in a NULL
struct command *
to send_outreq
), in much the same way that we do not pass in struct command
to plugin timers.
This is the approach I initially planned, but I realized that denying access to struct command *
inside sparks means that, when a spark encounters an error from a command that has to succeed, it cannot use the forward_error
convenience function to report failure immediately and end the plugin-level command. Instead, the spark has to store the error in the plugin-command-specific structure and then just complete itself, letting the "main" plugin-command processing (which has access to struct command *
) propagate the error.
But suppose instead that we could fail a command from any spark derived from that command. Then if a spark fails the command (causing it to be deallocated):
send_outreq
or in plugin_wait_sparks
, would simply disappear from execution.send_outreq
commands blocking sparks of the plugin-level command would have their result (success or failure) simply ignored, maybe can print in the plugin log ("lightningd command completed after plugin-level command completed with result: %.*s
").In this view, the execution triggered from a plugin command arriving from lightningd
to the plugin would simply be the "initial spark" of the command.
So ---
send_outreq
robust against the command it is allocated from being destroyed.
struct command *
and containing the callbacks.struct command *
is destroyed, it clears the pointer in the other structure (the one in the command ID map).lightningd
arrives, we look up the structure in the command ID map. If its pointer to the other structure is clear, then that means the plugin-level command was completed before the lightningd
command completed, so just log it and drop. If not clear, then we clear the pointer of the other structure, schedule to delete it later, then invoke the appropriate callback. In both cases we remove the structure from the command ID map and outright destroy it.This allows send_outreq
to simply disappear out of existence if the struct command *
used in it is failed or succeeded.
For multifundchannel
specifically, the connect
s in parallel would appreciate being able to just forward_error
a failure to connect
to the peer. Other connection attempts would remain ongoing. The fundchannel_start
s and fundchannel_continue
s in parallel would not, because we have more complex cleanup requiring that we fundchannel_cancel
any peers that have successfully fundchannel_start
ed at the time any other fundchannel_start
or fundchannel_continue
failed.
I've been refactoring libplugin to use ccan/io for plugin interaction with lightningd
, I'm now considering using io_plan
s also for RPC, which I believe would help about concurrency.
How I'm thinking this is pretty much like how lightningd/plugin
talks to plugins:
send_outreq
tal_expand
s a js_arr
of streams to send and io_wake
s the writer planio_plan *read_ld_response
reads the response's id and calls the accurate callback.Do you think we could "kill two birds with one stone" by just using ccan/io in libplugin ?
In https://github.com/ZmnSCPxj/lightning/tree/multifundchannel I wrote concurrency "on top of" plugin_timers
, with minimal changes to the basic libplugin
itself. https://github.com/ZmnSCPxj/lightning/commit/8c8a610b1ae15cd2fa96229bb80296f1707c9029
So using ccan/io
is not strictly needed here. It does seem to me that you can rewrite libplugin
to use ccan/io
and provide nearly the same interface to plugins, and that interface is just used by my sparks system anyway without changing implementation.
Why not ? I've mostly implemented it (https://github.com/darosior/lightning/commits/libplugin_io_plan) with no change to the interface apart plugin_log()
.
send_outreq
is also part of this, but I have not tested concurrency. Anyway it can be reseted (or merged with) to include your spark system.
Currently, libplugin inherently assumes that every plugin-level command will have a single "thread" of operations.
For example,
pay
might do something like:While designing a C plugin for
multifundchannel
for #1936, however, I think it is best if we can support sending multiple commands in parallel, e.g.:Currently
send_outreq
is designed to just send commands one at a time, with processing of the plugin command suspended while the lower-level command is executed.This is undesirable for
multifundchannel
since we would end up waiting for one peer to finish some step, then move to the next peer, when in principle since those are (presumably) separate peers then we should be able to at least start the phase in parallel with all of them. That way, the time of a phase is only the slowest peer, instead of the sum of all the times of all peers.What I would like to have is something in
libplugin.h
that is very much like:To some extent, the current
plugin_timer
has a lot of similarities to this, and with some modification toplugin_timer
(specifically, allow creating a timer that has an argument and astruct command
) and access to thestruct command_result pending, complete;
, it might be possible to build the above sparks system on top of the plugin timers (by using timers with0
timerel
as the spark, construct a spark object that serves as coordinator between the timer and the "main thread", etc.....).Then for the
multifundchannel
case, during theconnect
phase we would start sparks for each peer listed, saving thespark
for each, then wait for all the sparks to complete and check the results for theconnect
commands.Of course, concurrency is much harder.................. sigh.
Thoughts? @rustyrussell @niftynei @cdecker ?