ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.82k stars 896 forks source link

Better feedback when pay takes long #1088

Closed Sjors closed 6 years ago

Sjors commented 6 years ago

The past few days I'm having difficulty making payments, with the pay command just hanging there. Aborting it and trying again results in "Payment is still in progress". Log entries aren't super clear (to me) either. It would be nice to get more details.

One approach could be that the pay command has a verbose argument that shows what it's doing.

Another would be a paymentlog command, which prints whatever log information is available for a given payment.

In particular I'd like to see some sort of table which lists the steps that are in progress, i.e. which nodes it's waiting on for what, etc.

ZmnSCPxj commented 6 years ago

The way lightning-cli is implemented is as a simple JSON-RPC wrapper, meaning we cannot interact with the caller of lightning-cli. The only way to print something is to return a definite failure or a definite success.

What I intend to do is to create startpay and statuspay and waitpay. startpay just starts the payment and returns immediately. statuspay polls what payment is doing. waitpay stalls until the payment definitely succeeds or definitely fails.

In particular I'd like to see some sort of table which lists the steps that are in progress, i.e. which nodes it's waiting on for what, etc.

The only node it can wait on is the first direct peer it initially sent to. The details of where the payment has reached, if some peer other than the first direct peer died or disconnected, is not known by the current node (and might not be known by any of the other nodes on the route either). In short: the only thing we can report is whether it succeeded, it failed, or we are still waiting for the payment to complete from the first direct peer, which is the only node it talks to.

ZmnSCPxj commented 6 years ago

Fleshing things out:

$ lightning-cli startpay ${BOLT11}
{ pay_id: 123456 }
$ lightning-cli statuspay 123456
{ completed: false, faillog: [], status: "Getting Route", getroute_tries: 0, sendpay_tries: 0 }
$ lightning-cli statuspay 123456
{ completed: false, faillog: [], status: "Attempting payment", getroute_tries: 1, sendpay_tries: 0, route: [{id: "0211111111111", channel: "111:11:1", msatoshi: 1000, delay: 42}] }
$ lightning-cli statuspay 123456
{ completed: false, faillog: ["WIRE_TEMPORARY_CHANNEL_FAILURE at channel 111:11:1"], status: "Getting Route", getroute_tries: 1, sendpay_tries: 1 }
$ lightning-cli statuspay 123456
{ completed: false, faillog: ["WIRE_TEMPORARY_CHANNEL_FAILURE at channel 111:11:1"], status: "Attempting payment", getroute_tries: 2, sendpay_tries: 1, route: [{id: "02111111111", channel: "222:22:1", msatoshi: 1000, delay: 24}] }
$ lightning-cli statuspay 123456
{ completed: true, succeeded: true, faillog: ["WIRE_TEMPORARY_CHANNEL_FAILURE at channel 111:11:1"], status: "Finished", getroute_tries: 2, sendpay_tries: 2, route: [{id: "0211111111", channel: "222:22:1", msatoshi: 1000, delay: 24}] }
# Needed to clean up memory used by payment system
# Waits if completed false; returns immediately if completed true.
# After it returns, the payment id will no longer be valid.
$ lightning-cli waitpay 123456

Perhaps also add a trycancelpay which attempts to cancel the payment (note that cancellation of payment may fail if the payment reached the destination).

@Sjors @practicalswift Thoughts?

ZmnSCPxj commented 6 years ago

Hmmm, thinking more, perhaps I should instead do the below:

  1. startasyncpay bolt11 - Starts an asynchronous payment. Returns a pay_id immediately.
  2. waitasyncpay pay_id - Pops a message from the asynchronous payment message log, and returns it. If the payment has ended and we have popped the message indicating success/failure, this also cleans up the pay_id. Blocks if no message in the log.
  3. abortasyncpay pay_id - Attempts to stop a payment; the payment can still push through (for example if you cancelled while a payment is in-flight and it manages to reach the payee).

So basically each payment has a queue of messages it will dump to tell the story of what it is doing, and an external driving program performs the above in sequence.

Basically I am proposing to create a separate lightning-pay tool that will print the messages continuously from the payment as it proceeds.

Sjors commented 6 years ago

I'm not a huge fan of separate commands for sync and async payments. For one thing, you may not know that you want to debug a payment until it's started. I think it's better if pay waits for a few seconds and, if not complete, returns with "payment in progress, you can monitor progress using status pay 123456"

ZmnSCPxj commented 6 years ago

That seems reasonable to me.

ZmnSCPxj commented 6 years ago

Okay, fleshing it out some more.

I believe the below events are what are significant enough for paystatus (or whatever I end up calling it) to report:

Sjors commented 6 years ago

One use case for this RPC I can imagine is a GUI that's trying to render progress to the user. That suggests payments should have states, and paystatus should return this state along with whatever additional information is useful.

ZmnSCPxj commented 6 years ago

Well, the only states are "looking for route", "attempting payment", "failed", or "succeeded"; we cannot get more detail. Attempting payment can take up to the lock time of the route, and we cannot get information from nodes on the route if those nodes decide to delay rather than reply as soon as they are able.

ZmnSCPxj commented 6 years ago

A good number of issues recently is mostly about payment not giving good feedback, and would be helped much more tremendously if we could provide information to the user instead of having them dig through logs. So putting this in 0.6 milestone and high priority...

ZmnSCPxj commented 6 years ago

Plan:

Sjors commented 6 years ago

Looks good to me. Some small tweak suggestions:

ZmnSCPxj commented 6 years ago

"failed" could also be split into "failed_to_find_route" and "failed_to_pay", though perhaps it shouldn't be split

The only failure that can be differentiated is whether the payee was reached, but refused to accept the payment, and if we have run out of viable routes (i.e. we never managed to reach the payee). But I suppose the distinction is important.

I don't think a payment log should ever clear; if a payment succeeds but took 8 seconds there might still be interesting info in the log. paystatus could have an argument to only return new entries (or that might be the default)

I think caller should have responsibility of the log, so it should be able to remember the log entries that have been dumped before, i.e. always just return new entries and never store already returned log entries.

rather than multiple commands

I think multiple commands are superior than adding too many modes to a single command. In particular we should consider how to handle keyword vs positional arguments. So also separate waitpaystatus and paystatus.


Note that we are discussing and planning extensive changes to the JSON-RPC of the pay command. So we should probably invite @cdecker and @rustyrussell to this discussion (especially since I would like very much to not implement code they would reject anyway). I would also like to invite @bitonic-cjp as I believe https://github.com/bitonic-cjp/fireworks is very likely to be affected. Do you know anyone else building wallets (as opposed to payment processors like BtcPay and lightning-charge) on top of c-lightning?

Sjors commented 6 years ago

I think caller should have responsibility of the log, so it should be able to remember the log entries that have been dumped before

What about a user who sees an error, contacts some support team, is then asked to provide the logs? They might have already lost them at that point. Same for a misbehaving (or just slightly buggy) consumer of the RPC.

bitonic-cjp commented 6 years ago

As a GUI developer, I like to have non-blocking calls, to keep my GUI responsive. At the very least there should be calls to start payments and to retrieve the current status. I guess these are already covered in the proposals in this discussion.

Finally, it'd also be nice to have an interface that automatically notifies me of any changes. I'm not sure if JSON-RPC is suitable for that. Addition of a notification interface is something that would be useful for many other things as well.

BTW, I'll probably also be developing some (proprietary?) web-based Lightning software for Bitonic; I think the same wish list applies there.

About the payment status, and logs: would you like to store that on disk (server-side, e.g. in C-Lightning's sqlite database)? I guess you should: it may not be recommended to shut down your node while a payment is in progress, but accidents do happen (power failure etc.), and the behavior of C-Lightning after a shutdown+restart should be the same as if the shutdown+restart didn't happen, as much as possible.

ZmnSCPxj commented 6 years ago

As a GUI developer, I like to have non-blocking calls, to keep my GUI responsive.... Finally, it'd also be nice to have an interface that automatically notifies me of any changes.

The way it is implemented, is as a blocking call! Consider waitanyinvoice. You invoke this call, the call blocks, then when a change occurs (a new payment arrives) it returns. You process the payment and then call waitanyinvoice again.

Indeed, that is why I propose a waitpaystatus also --- it waits for changes in the pay status/log and then returns. If there is a change it has not reported yet it returns immediately. The alternative is a non-waiting paystatus that gives you the set of changes it has seen since the last time you called it. It is also the reason why I want to clear the logs in the c-lightning side --- this is a useful technique to implement "getting the last set of changes since the caller called": you report everything now, then clear the log so you will not report them again next time you are called. But in a GUI situation, paystatus would have to be polled, and if you poll too often you are wasting CPU getting back nothing new and if you do not poll enough then updates may not be noticed fast enough and users get inferior UX, hence waitpaystatus.

Indeed it should be possible to make an async interface in the Python side to any blocking call (e.g. futures which is used in tests/test_lightningd.py) to prevent hangs in your UI thread.

Now, to be fair, JSON-RPC allows setting up notifications: you make a call, then the server will send back 0 or more responses. But the c-lightning JSON-RPC will require even more rework to do so, and our included client interfaces lightning-cli and pylightning do not expect notifications and will also need rework (lightning-cli in particular is likely will need special handling of pay in that situation). And clients using JSON-RPC will also require rework (but they would also require rework due to the proposed changes in JSON-RPC).

What we have here is a juggling act of balancing some CLI and other client uses. It is also the reason why I propose deprecating lightning-cli pay for CLI use in favor of a specific lightning-pay CLI that uses the JSON-RPC to provide continuous feedback on the payment as @Sjors originally wanted in this issue.

About the payment status, and logs: would you like to store that on disk (server-side, e.g. in C-Lightning's sqlite database)?

I would like to defer that, although the principle behind it is sound. I still have some tweaks I want to apply to the payment algorithm, and that will require changes in what data I need to keep around.

I am also starting to wonder if I should just rip out pay entirely and put its functionality in a separate lightning-pay CLI. Then clients would have to use the low-level getroute and sendpay, or decide to invoke lightning-pay.

rustyrussell commented 6 years ago

So, to be clear: the pay command is a convenience wrapper; it used to be a separate binary.

GUIs etc will want an API like getroute / sendpay (which would just start the payment, not block like it does now) / listpayments to poll. waitanypay could be added later, but this would do for 0.6.

listpayments should also be useful for monitoring the pay command: if it isn't now, let's fix it!

If we keep pay it should, indeed, have a default timeout. After this timeout it should stop retrying: if after another second it hasn't terminated, it should return anyway. The format of pay return should be the same as listpayments <item> (except probably not in array form).

Also note: the terms rhash and r predate the spec, and are called payment_hash and payment_preimage respectively. rhash and r should not appear in the JSON API. We should fi that.

(With schnorr and scriptless sigs the terms hash and preimage no longer apply either, but all the alternative names I can think of are too generic, so they'll probably stay).

ZmnSCPxj commented 6 years ago

Hmm, so... put it back into a separate binary again, so it can provide "real-time" feedback that is not easy with our current JSON-RPC impl? Is that OK?

I propose that we also need methods for getblockheight and waitblockheight; individual sendpay attempts have a timeout (the starting delay of the first hop) measured in blocks, so a separate lightning-pay binary should be able to provide the number of blocks remaining before definite timeout-based failure.

Note that a seconds-based timeout would not reliably be able to cancel payment anyway: if the channel gets dropped onchain, you can only reliably retrieve your released funds after the blocks timeout. This is already done automatically but it may surprise users that the payment could still be claimed all the way up to the blockheight timeout.

(BTW, I have not delved deeply into peer_htlcs, but if we have an HTLC on-channel that has gone past the timelock, and the peer suddenly sends an update_fulfill_htlc, what do we do? I imagine we should drop onchain at that point and not fulfill the HTLC, since the timelock has gone past.)

rustyrussell commented 6 years ago

For 0.6, let's stick with pay, with a timeout, and making sendpay non-blocking. Since the default pay timeout would be something like 30 seconds I expect, you won't need a blow-by-blow description anyway; if it fails you'll get the latest status.

getinfo has a blockheight field already.

You can't ever cancel payments, you can just stop retrying. Ideally, listpayments would give an indication as to the worst-case timeout.

Writing a 'pay' wrapper seems like a cool project for Someone Else, who can do it any time and independent of our release schedules, as long as we get the APIs right.

And yes: if an HTLC goes past timeout without getting removed by the peer, we drop to chain.

ZmnSCPxj commented 6 years ago

Okay, if we go that route, we need to add a waitsendpay command (even if it is only internal; but if our pay wrapper will use that then I think we should expose it too so that Someone Else can do the pay wrapper) that blocks until some sendpay. And possibly a waitblockheight that waits for the block height to go past some specified block height for monitoring.

So we should add more info in listpayments, possibly provide some kind of filtering on it so we can focus on only one payment. Do we need a log of routing failures at least for each payment? Users think getroute returning a valid route means it can reach the payee but this is not true; we may need to tell users that we tried those routes but some channel/node failed to propagate.

ZmnSCPxj commented 6 years ago

Okay, so from @rustyrussell :

  1. Make sendpay asynchronous (return immediately). Add a waitsendpay that waits for a sendpay attempt to complete. sendpay continues to be a "low-level" interface that just initiates a single payment attempt and never retries.
  2. Put a timeout to pay, default 20 seconds (20 because that is how long temporary channel bans are). If a single sendpay fails, it retries, but only up to the timeout. When timing out, warn also that the payment could still push through after the timeout.
  3. Save channels and nodes that get banned due to routing failure somewhere somehow. When to delete this information so it does not keep taking up space in memory/disk?
  4. Expand the information that listpayments provides.
ZmnSCPxj commented 6 years ago

Also note: the terms rhash and r predate the spec, and are called payment_hash and payment_preimage respectively. rhash and r should not appear in the JSON API. We should fi that.

How about dev-rhash command? Or is that acceptable being a dev- command? Should I retain its output as rhash or...?

ZmnSCPxj commented 6 years ago

Came across this bit of code:

https://github.com/ElementsProject/lightning/blob/dace9bfdcfd9bbd68a7661f0dfedf69cf42d7cba/wallet/wallet.c#L1541-L1547

This means that if the payment fails while negotiating with channeld (most likely due to WIRE_AMOUNT_BELOW_MINIMUM) the payment disappears.

The issue is that for a nonblocking sendpay we need to keep track of payments that fail, including payments that fail while negotiating with channeld, so that subsequent waitsendpay has something to wait on.

Our options are:

  1. Modify our sendpay to wait for channeld to reply from the channel_offer_htlc command. This is not quite nonblocking, but catches more problems (invalid expiry, capacity exceeded, payment too small, too many htlcs) and at least retains the property that "early" failures like this do not get saved on DB. The problem is that this complicates the interaction between pay.c and peer_htlcs.c. In particular, what happens if channeld dies after we send the channel_offer_htlc but before channeld is able to send the channel_offer_htlc_reply? Need to handle that case; I worry if there are other cases I miss!
  2. Remove the above and just do assert(!payment) i.e. unstored payments cannot be updated. And perform a wallet_payment_store before send_payment returns without an error. But this mostly invalidates the point of having unstored payments: we store them as soon as we can.

If nobody objects, I shall go with option 2 above.

bitonic-cjp commented 6 years ago

What is a nonstored payment?

I'd expect, if I perform a sendpay command and it returns without an exception, I have changed the (non-volatile) state of lightningd in such a way that, from now on, it tries to perform the given payment. From now on, I should be able to query the status of that payment.

If I receive an exception, the new state of lightningd might depend on the exception type, but generally speaking, the payment probably wasn't started and the state of lightningd wasn't changed.

If lightningd dies before I receive anything, I don't know whether sendpay was successful. After a restart of lightningd, I should be able to determine this by asking lightningd about the status of payments.

If lightningd doesn't die, but doesn't reply anything either (or if for some reason I can't determine whether lightningd has died), I should know what the maximum duration of the sendpay command is, so I can set a time-out on it and declare lightningd unresponsive from that point on(*).

BTW I think this is true for all commands I'd like to use: they should all have a quite short maximum duration, that only depends on things under local control (e.g. CPU, disk performance), and not on peer behavior. From the POV of the (GUI) application, it's OK if lightningd waits for channeld, as long as channeld doesn't wait for peers before it replies to lightningd. It's also OK if lightningd doesn't wait for channeld, to be robust against bugs in channeld; that's an internal design decision for lightningd to make, which doesn't need to impact the API.

(*) I might continue to check in the background to discover when lightningd becomes responsive again.

ZmnSCPxj commented 6 years ago

What is a nonstored payment?

This concept was added so that we do not save payments until the HTLC is committed to by adding it to a commitment transaction on our side. Until the commitment is made, our offer to pay is not backed by any transaction that we can transmit onchain in case of bad behavior. So we should not save the payment on-disk, since the payment might not even be accepted by the first peer we are talking to. We only save the payment on-disk if the payment has been provisionally accepted as an HTLC by the peer providing a signature to the commitment transaction containing it.

(payments go through 4 steps: (1) first it is an offer from the payer for an HTLC (not an HTLC yet), (2) then it is an HTLC on some commitment transaction, (3) then it is an offer from the payee to provide the hash preimage (with the HTLC still on the commitment transaction in case of trouble happening), (4) then it becomes an actual movement of funds from payer to payee (when the payer confirms the preimage is correct). until step 2 we keep the payment unstored.)

The issue is the nonblocking sendpay. We can do the following:

  1. Return immediately with 'completed' : false even if the payment is still unstored. Store the payment on failure. But if lightningd crashes before payment can be committed to, the payment is forgotten on startup as it is not committed to.
  2. Wait for payment to become committed, which stores the payment on-disk, before sendpay returns, and return with 'completed' : false (the payment completes when the peer provides us the preimage or the peer returns a failure). But if the peer stalls, the sendpay is not nonblocking after all.

I suppose I shall have to ask @rustyrussell for advice on how to move forward with this. I am unsure what the peer_htlcs code assumes about payments that have been stored on-disk; if peer_htlcs assumes that payments on-disk are already committed to in some commitment transaction then it would be unsafe to store them on-disk and the above two solutions are our only solutions, unless we audit peer_htlcs again to check this....

This part of handling payments has been getting a bit of churn already over the past two months (splitting struct payment from struct invoice, splitting struct payment from struct pay_command, removing struct pay_command, creating a new struct pay that is morally equivalent to struct pay_command except in payalgo.c, creating a concept of an "unstored' and "stored" struct payment, and now my conundrum about payments...) suggesting we have not yet come to some good conclusion on how to handle payments both on-disk and in-memory....... sigh.

bitonic-cjp commented 6 years ago

I guess this is also an opportunity to learn about the internal architecture of C-Lightning (or is this going too far off-topic?).

In Amiko Pay, I had an architecture where the only non-busy state is to wait for events (API calls, incoming messages or scheduled time-outs). Whenever an event happens, it is processed, resulting in a list of messages to be sent out and an update of the node state (maybe including a new time-out schedule). Then, the node state update is committed to disk, followed by sending out the messages. This way, being non-blocking is the only natural mode of operation, and if you wanted blocking, you'd implement it in a higher-level layer. Note that Amiko Pay wasn't perfect: for simplicity's sake it always replaced the entire state on-disk, which isn't the most efficient thing to do.

From this background it's strange to see that being non-blocking is apparently (sometimes) difficult in C-Lightning. I'd say that, once you initiate a payment by sending out the very first message in the sequence, you already have a state that's worth storing, and a state in which you can return a sendpay command.

Edit: that state would be something like "offered payment to <peer0>; no HTLC yet; not tried any other peers yet". It can later promote, for instance, to "accepted by <peer0>; offered HTLC" or "offered payment to <peer1>; no HTLC yet; rejected by <peer0>; not tried any other peers yet" or "rejected by all peers".

ZmnSCPxj commented 6 years ago

I think part of the problem here is that I and @rustyrussell have been pulling at struct payment in opposite directions for different purposes: I to get features needed by the RPC, Rusty to get features needed by the peer communications/HTLC. I am now hesitant to pull at the structure again; it might be better to sit down and redesign that part (modulo having to keep compatibility with existing database!).

So the issue is not so much about nonblocking being difficult here, but rather, with a lack of familiarity with another part of the code that potentially interacts with code I am modifying. So code diving for now...

In any case maybe the "unstored" payment state is simply premature optimization; but I still have to check more deeply into the HTLC side to determine if I can safely store payments immediately or not. That was what was done before, and then "unstored" payments came in during a massive reorg/cleanup of HTLC handling, so maybe there is now something on the HTLC side that requires the database not to contain payments that do not have HTLCs committed to yet.