Closed Sjors closed 6 years ago
The way lightning-cli
is implemented is as a simple JSON-RPC wrapper, meaning we cannot interact with the caller of lightning-cli
. The only way to print something is to return a definite failure or a definite success.
What I intend to do is to create startpay
and statuspay
and waitpay
. startpay
just starts the payment and returns immediately. statuspay
polls what payment is doing. waitpay
stalls until the payment definitely succeeds or definitely fails.
In particular I'd like to see some sort of table which lists the steps that are in progress, i.e. which nodes it's waiting on for what, etc.
The only node it can wait on is the first direct peer it initially sent to. The details of where the payment has reached, if some peer other than the first direct peer died or disconnected, is not known by the current node (and might not be known by any of the other nodes on the route either). In short: the only thing we can report is whether it succeeded, it failed, or we are still waiting for the payment to complete from the first direct peer, which is the only node it talks to.
Fleshing things out:
$ lightning-cli startpay ${BOLT11}
{ pay_id: 123456 }
$ lightning-cli statuspay 123456
{ completed: false, faillog: [], status: "Getting Route", getroute_tries: 0, sendpay_tries: 0 }
$ lightning-cli statuspay 123456
{ completed: false, faillog: [], status: "Attempting payment", getroute_tries: 1, sendpay_tries: 0, route: [{id: "0211111111111", channel: "111:11:1", msatoshi: 1000, delay: 42}] }
$ lightning-cli statuspay 123456
{ completed: false, faillog: ["WIRE_TEMPORARY_CHANNEL_FAILURE at channel 111:11:1"], status: "Getting Route", getroute_tries: 1, sendpay_tries: 1 }
$ lightning-cli statuspay 123456
{ completed: false, faillog: ["WIRE_TEMPORARY_CHANNEL_FAILURE at channel 111:11:1"], status: "Attempting payment", getroute_tries: 2, sendpay_tries: 1, route: [{id: "02111111111", channel: "222:22:1", msatoshi: 1000, delay: 24}] }
$ lightning-cli statuspay 123456
{ completed: true, succeeded: true, faillog: ["WIRE_TEMPORARY_CHANNEL_FAILURE at channel 111:11:1"], status: "Finished", getroute_tries: 2, sendpay_tries: 2, route: [{id: "0211111111", channel: "222:22:1", msatoshi: 1000, delay: 24}] }
# Needed to clean up memory used by payment system
# Waits if completed false; returns immediately if completed true.
# After it returns, the payment id will no longer be valid.
$ lightning-cli waitpay 123456
Perhaps also add a trycancelpay
which attempts to cancel the payment (note that cancellation of payment may fail if the payment reached the destination).
@Sjors @practicalswift Thoughts?
Hmmm, thinking more, perhaps I should instead do the below:
startasyncpay bolt11
- Starts an asynchronous payment. Returns a pay_id
immediately.waitasyncpay pay_id
- Pops a message from the asynchronous payment message log, and returns it. If the payment has ended and we have popped the message indicating success/failure, this also cleans up the pay_id
. Blocks if no message in the log.abortasyncpay pay_id
- Attempts to stop a payment; the payment can still push through (for example if you cancelled while a payment is in-flight and it manages to reach the payee).So basically each payment has a queue of messages it will dump to tell the story of what it is doing, and an external driving program performs the above in sequence.
Basically I am proposing to create a separate lightning-pay
tool that will print the messages continuously from the payment as it proceeds.
I'm not a huge fan of separate commands for sync and async payments. For one thing, you may not know that you want to debug a payment until it's started. I think it's better if pay
waits for a few seconds and, if not complete, returns with "payment in progress, you can monitor progress using status pay 123456
"
That seems reasonable to me.
Okay, fleshing it out some more.
I believe the below events are what are significant enough for paystatus
(or whatever I end up calling it) to report:
One use case for this RPC I can imagine is a GUI that's trying to render progress to the user. That suggests payments should have states, and paystatus
should return this state along with whatever additional information is useful.
Well, the only states are "looking for route", "attempting payment", "failed", or "succeeded"; we cannot get more detail. Attempting payment can take up to the lock time of the route, and we cannot get information from nodes on the route if those nodes decide to delay rather than reply as soon as they are able.
A good number of issues recently is mostly about payment not giving good feedback, and would be helped much more tremendously if we could provide information to the user instead of having them dig through logs. So putting this in 0.6 milestone and high priority...
pay
- add new returnafter
setting (default 5), the number of seconds after which we will return to the user regardless of whether we succeeded or not. Add a new error to pay
, PAY_RETURN_TO_CALLER
, with fields pay_id
and log
, and message "Returning to caller after 5 seconds; monitor payment with paystatus 42
or waitpaystatus 42
"paystatus
shows the most recent log as status
and dumps the entire contents of the message queue as log
(and clears the log), and returns immediately.waitpaystatus
pops from the queue of messages, or waits if the message queue is empty."completed" : true
field. Once this message is popped from the log, the log is cleared from memory and the pay_id
becomes invalid. This is used for definite failure or definite success of the payment.paycancel
attempts to stop a payment; in any case it will prevent the payment from being retried the next time it fails to route.message
, and a messagecode
numeric, and details that depend on messagecode
, and a completed
field above (which defaults to false
unless this is the log message for definite failure or definite success). This is to support internationalized user interfaces.lightning-pay
command that uses the above interface and should be the preferred command-line interface for payment.Plan:
pay
command has a paylog
object that stores the log. Use tal_link
to store the latest entry in both the queue and as the status
field (or just duplicate them). Store json_result
per entry?payalgos
object that the lightningd
object must have. Add a timer to pay
on starting. If the return-to-caller timer triggers, reparent the pay
from the command to the payalgos
object, then return with the new PAY_RETURN_TO_CALLER
error. Implement paystatus
, waitpaystatus
and paycancel
.lightning-pay
cli and put big note in README.md
to use that instead of lightning-cli pay
.Looks good to me. Some small tweak suggestions:
waitpaystatus
with paystatus 42 wait
paystatus
and pay
should return a JSON object with a state
field (in addition to or instead of returning error codes)paystatus
could have an argument to only return new entries (or that might be the default)pay :bolt11
, pay status :id [wait]
, pay cancel :id
?"failed" could also be split into "failed_to_find_route" and "failed_to_pay", though perhaps it shouldn't be split
The only failure that can be differentiated is whether the payee was reached, but refused to accept the payment, and if we have run out of viable routes (i.e. we never managed to reach the payee). But I suppose the distinction is important.
I don't think a payment log should ever clear; if a payment succeeds but took 8 seconds there might still be interesting info in the log.
paystatus
could have an argument to only return new entries (or that might be the default)
I think caller should have responsibility of the log, so it should be able to remember the log entries that have been dumped before, i.e. always just return new entries and never store already returned log entries.
rather than multiple commands
I think multiple commands are superior than adding too many modes to a single command. In particular we should consider how to handle keyword vs positional arguments. So also separate waitpaystatus
and paystatus
.
Note that we are discussing and planning extensive changes to the JSON-RPC of the pay
command. So we should probably invite @cdecker and @rustyrussell to this discussion (especially since I would like very much to not implement code they would reject anyway). I would also like to invite @bitonic-cjp as I believe https://github.com/bitonic-cjp/fireworks is very likely to be affected. Do you know anyone else building wallets (as opposed to payment processors like BtcPay and lightning-charge) on top of c-lightning?
I think caller should have responsibility of the log, so it should be able to remember the log entries that have been dumped before
What about a user who sees an error, contacts some support team, is then asked to provide the logs? They might have already lost them at that point. Same for a misbehaving (or just slightly buggy) consumer of the RPC.
As a GUI developer, I like to have non-blocking calls, to keep my GUI responsive. At the very least there should be calls to start payments and to retrieve the current status. I guess these are already covered in the proposals in this discussion.
Finally, it'd also be nice to have an interface that automatically notifies me of any changes. I'm not sure if JSON-RPC is suitable for that. Addition of a notification interface is something that would be useful for many other things as well.
BTW, I'll probably also be developing some (proprietary?) web-based Lightning software for Bitonic; I think the same wish list applies there.
About the payment status, and logs: would you like to store that on disk (server-side, e.g. in C-Lightning's sqlite database)? I guess you should: it may not be recommended to shut down your node while a payment is in progress, but accidents do happen (power failure etc.), and the behavior of C-Lightning after a shutdown+restart should be the same as if the shutdown+restart didn't happen, as much as possible.
As a GUI developer, I like to have non-blocking calls, to keep my GUI responsive.... Finally, it'd also be nice to have an interface that automatically notifies me of any changes.
The way it is implemented, is as a blocking call! Consider waitanyinvoice
. You invoke this call, the call blocks, then when a change occurs (a new payment arrives) it returns. You process the payment and then call waitanyinvoice
again.
Indeed, that is why I propose a waitpaystatus
also --- it waits for changes in the pay status/log and then returns. If there is a change it has not reported yet it returns immediately. The alternative is a non-waiting paystatus
that gives you the set of changes it has seen since the last time you called it. It is also the reason why I want to clear the logs in the c-lightning side --- this is a useful technique to implement "getting the last set of changes since the caller called": you report everything now, then clear the log so you will not report them again next time you are called. But in a GUI situation, paystatus
would have to be polled, and if you poll too often you are wasting CPU getting back nothing new and if you do not poll enough then updates may not be noticed fast enough and users get inferior UX, hence waitpaystatus
.
Indeed it should be possible to make an async interface in the Python side to any blocking call (e.g. futures which is used in tests/test_lightningd.py
) to prevent hangs in your UI thread.
Now, to be fair, JSON-RPC allows setting up notifications: you make a call, then the server will send back 0 or more responses. But the c-lightning JSON-RPC will require even more rework to do so, and our included client interfaces lightning-cli
and pylightning
do not expect notifications and will also need rework (lightning-cli
in particular is likely will need special handling of pay
in that situation). And clients using JSON-RPC will also require rework (but they would also require rework due to the proposed changes in JSON-RPC).
What we have here is a juggling act of balancing some CLI and other client uses. It is also the reason why I propose deprecating lightning-cli pay
for CLI use in favor of a specific lightning-pay
CLI that uses the JSON-RPC to provide continuous feedback on the payment as @Sjors originally wanted in this issue.
About the payment status, and logs: would you like to store that on disk (server-side, e.g. in C-Lightning's sqlite database)?
I would like to defer that, although the principle behind it is sound. I still have some tweaks I want to apply to the payment algorithm, and that will require changes in what data I need to keep around.
I am also starting to wonder if I should just rip out pay
entirely and put its functionality in a separate lightning-pay
CLI. Then clients would have to use the low-level getroute
and sendpay
, or decide to invoke lightning-pay
.
So, to be clear: the pay command is a convenience wrapper; it used to be a separate binary.
GUIs etc will want an API like getroute
/ sendpay
(which would just start the payment, not block like it does now) / listpayments
to poll. waitanypay could be added later, but this would do for 0.6.
listpayments
should also be useful for monitoring the pay
command: if it isn't now, let's fix it!
If we keep pay
it should, indeed, have a default timeout. After this timeout it should stop retrying: if after another second it hasn't terminated, it should return anyway. The format of pay
return should be the same as listpayments <item>
(except probably not in array form).
Also note: the terms rhash
and r
predate the spec, and are called payment_hash
and payment_preimage
respectively. rhash
and r
should not appear in the JSON API. We should fi that.
(With schnorr and scriptless sigs the terms hash
and preimage
no longer apply either, but all the alternative names I can think of are too generic, so they'll probably stay).
Hmm, so... put it back into a separate binary again, so it can provide "real-time" feedback that is not easy with our current JSON-RPC impl? Is that OK?
I propose that we also need methods for getblockheight
and waitblockheight
; individual sendpay
attempts have a timeout (the starting delay of the first hop) measured in blocks, so a separate lightning-pay
binary should be able to provide the number of blocks remaining before definite timeout-based failure.
Note that a seconds-based timeout would not reliably be able to cancel payment anyway: if the channel gets dropped onchain, you can only reliably retrieve your released funds after the blocks timeout. This is already done automatically but it may surprise users that the payment could still be claimed all the way up to the blockheight timeout.
(BTW, I have not delved deeply into peer_htlcs
, but if we have an HTLC on-channel that has gone past the timelock, and the peer suddenly sends an update_fulfill_htlc
, what do we do? I imagine we should drop onchain at that point and not fulfill the HTLC, since the timelock has gone past.)
For 0.6, let's stick with pay, with a timeout, and making sendpay non-blocking. Since the default pay timeout would be something like 30 seconds I expect, you won't need a blow-by-blow description anyway; if it fails you'll get the latest status.
getinfo has a blockheight field already.
You can't ever cancel payments, you can just stop retrying. Ideally, listpayments would give an indication as to the worst-case timeout.
Writing a 'pay' wrapper seems like a cool project for Someone Else, who can do it any time and independent of our release schedules, as long as we get the APIs right.
And yes: if an HTLC goes past timeout without getting removed by the peer, we drop to chain.
Okay, if we go that route, we need to add a waitsendpay
command (even if it is only internal; but if our pay
wrapper will use that then I think we should expose it too so that Someone Else can do the pay
wrapper) that blocks until some sendpay
. And possibly a waitblockheight
that waits for the block height to go past some specified block height for monitoring.
So we should add more info in listpayments
, possibly provide some kind of filtering on it so we can focus on only one payment. Do we need a log of routing failures at least for each payment? Users think getroute
returning a valid route means it can reach the payee but this is not true; we may need to tell users that we tried those routes but some channel/node failed to propagate.
Okay, so from @rustyrussell :
sendpay
asynchronous (return immediately). Add a waitsendpay
that waits for a sendpay
attempt to complete. sendpay
continues to be a "low-level" interface that just initiates a single payment attempt and never retries.pay
, default 20 seconds (20 because that is how long temporary channel bans are). If a single sendpay
fails, it retries, but only up to the timeout. When timing out, warn also that the payment could still push through after the timeout.listpayments
provides.Also note: the terms
rhash
andr
predate the spec, and are calledpayment_hash
andpayment_preimage
respectively.rhash
andr
should not appear in the JSON API. We should fi that.
How about dev-rhash
command? Or is that acceptable being a dev-
command? Should I retain its output as rhash
or...?
Came across this bit of code:
This means that if the payment fails while negotiating with channeld (most likely due to WIRE_AMOUNT_BELOW_MINIMUM
) the payment disappears.
The issue is that for a nonblocking sendpay
we need to keep track of payments that fail, including payments that fail while negotiating with channeld, so that subsequent waitsendpay
has something to wait on.
Our options are:
sendpay
to wait for channeld
to reply from the channel_offer_htlc
command. This is not quite nonblocking, but catches more problems (invalid expiry, capacity exceeded, payment too small, too many htlcs) and at least retains the property that "early" failures like this do not get saved on DB. The problem is that this complicates the interaction between pay.c
and peer_htlcs.c
. In particular, what happens if channeld dies after we send the channel_offer_htlc
but before channeld is able to send the channel_offer_htlc_reply
? Need to handle that case; I worry if there are other cases I miss!assert(!payment)
i.e. unstored payments cannot be updated. And perform a wallet_payment_store
before send_payment
returns without an error. But this mostly invalidates the point of having unstored payments: we store them as soon as we can.If nobody objects, I shall go with option 2 above.
What is a nonstored payment?
I'd expect, if I perform a sendpay command and it returns without an exception, I have changed the (non-volatile) state of lightningd in such a way that, from now on, it tries to perform the given payment. From now on, I should be able to query the status of that payment.
If I receive an exception, the new state of lightningd might depend on the exception type, but generally speaking, the payment probably wasn't started and the state of lightningd wasn't changed.
If lightningd dies before I receive anything, I don't know whether sendpay was successful. After a restart of lightningd, I should be able to determine this by asking lightningd about the status of payments.
If lightningd doesn't die, but doesn't reply anything either (or if for some reason I can't determine whether lightningd has died), I should know what the maximum duration of the sendpay command is, so I can set a time-out on it and declare lightningd unresponsive from that point on(*).
BTW I think this is true for all commands I'd like to use: they should all have a quite short maximum duration, that only depends on things under local control (e.g. CPU, disk performance), and not on peer behavior. From the POV of the (GUI) application, it's OK if lightningd waits for channeld, as long as channeld doesn't wait for peers before it replies to lightningd. It's also OK if lightningd doesn't wait for channeld, to be robust against bugs in channeld; that's an internal design decision for lightningd to make, which doesn't need to impact the API.
(*) I might continue to check in the background to discover when lightningd becomes responsive again.
What is a nonstored payment?
This concept was added so that we do not save payments until the HTLC is committed to by adding it to a commitment transaction on our side. Until the commitment is made, our offer to pay is not backed by any transaction that we can transmit onchain in case of bad behavior. So we should not save the payment on-disk, since the payment might not even be accepted by the first peer we are talking to. We only save the payment on-disk if the payment has been provisionally accepted as an HTLC by the peer providing a signature to the commitment transaction containing it.
(payments go through 4 steps: (1) first it is an offer from the payer for an HTLC (not an HTLC yet), (2) then it is an HTLC on some commitment transaction, (3) then it is an offer from the payee to provide the hash preimage (with the HTLC still on the commitment transaction in case of trouble happening), (4) then it becomes an actual movement of funds from payer to payee (when the payer confirms the preimage is correct). until step 2 we keep the payment unstored.)
The issue is the nonblocking sendpay
. We can do the following:
'completed' : false
even if the payment is still unstored. Store the payment on failure. But if lightningd crashes before payment can be committed to, the payment is forgotten on startup as it is not committed to.sendpay
returns, and return with 'completed' : false
(the payment completes when the peer provides us the preimage or the peer returns a failure). But if the peer stalls, the sendpay
is not nonblocking after all.I suppose I shall have to ask @rustyrussell for advice on how to move forward with this. I am unsure what the peer_htlcs
code assumes about payments that have been stored on-disk; if peer_htlcs
assumes that payments on-disk are already committed to in some commitment transaction then it would be unsafe to store them on-disk and the above two solutions are our only solutions, unless we audit peer_htlcs
again to check this....
This part of handling payments has been getting a bit of churn already over the past two months (splitting struct payment
from struct invoice
, splitting struct payment
from struct pay_command
, removing struct pay_command
, creating a new struct pay
that is morally equivalent to struct pay_command
except in payalgo.c
, creating a concept of an "unstored' and "stored" struct payment
, and now my conundrum about payments...) suggesting we have not yet come to some good conclusion on how to handle payments both on-disk and in-memory....... sigh.
I guess this is also an opportunity to learn about the internal architecture of C-Lightning (or is this going too far off-topic?).
In Amiko Pay, I had an architecture where the only non-busy state is to wait for events (API calls, incoming messages or scheduled time-outs). Whenever an event happens, it is processed, resulting in a list of messages to be sent out and an update of the node state (maybe including a new time-out schedule). Then, the node state update is committed to disk, followed by sending out the messages. This way, being non-blocking is the only natural mode of operation, and if you wanted blocking, you'd implement it in a higher-level layer. Note that Amiko Pay wasn't perfect: for simplicity's sake it always replaced the entire state on-disk, which isn't the most efficient thing to do.
From this background it's strange to see that being non-blocking is apparently (sometimes) difficult in C-Lightning. I'd say that, once you initiate a payment by sending out the very first message in the sequence, you already have a state that's worth storing, and a state in which you can return a sendpay command.
Edit: that state would be something like "offered payment to <peer0>; no HTLC yet; not tried any other peers yet". It can later promote, for instance, to "accepted by <peer0>; offered HTLC" or "offered payment to <peer1>; no HTLC yet; rejected by <peer0>; not tried any other peers yet" or "rejected by all peers".
I think part of the problem here is that I and @rustyrussell have been pulling at struct payment
in opposite directions for different purposes: I to get features needed by the RPC, Rusty to get features needed by the peer communications/HTLC. I am now hesitant to pull at the structure again; it might be better to sit down and redesign that part (modulo having to keep compatibility with existing database!).
So the issue is not so much about nonblocking being difficult here, but rather, with a lack of familiarity with another part of the code that potentially interacts with code I am modifying. So code diving for now...
In any case maybe the "unstored" payment state is simply premature optimization; but I still have to check more deeply into the HTLC side to determine if I can safely store payments immediately or not. That was what was done before, and then "unstored" payments came in during a massive reorg/cleanup of HTLC handling, so maybe there is now something on the HTLC side that requires the database not to contain payments that do not have HTLCs committed to yet.
The past few days I'm having difficulty making payments, with the
pay
command just hanging there. Aborting it and trying again results in "Payment is still in progress". Log entries aren't super clear (to me) either. It would be nice to get more details.One approach could be that the pay command has a
verbose
argument that shows what it's doing.Another would be a
paymentlog
command, which prints whatever log information is available for a given payment.In particular I'd like to see some sort of table which lists the steps that are in progress, i.e. which nodes it's waiting on for what, etc.