ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.84k stars 901 forks source link

attempted large pay to test out mpp, got a backtrace #3846

Closed thestick613 closed 4 years ago

thestick613 commented 4 years ago
pay: FATAL SIGNAL 11 (version v0.8.2-455-gde096ee)
0x41548a send_backtrace
        common/daemon.c:38
0x415514 crashdump
        common/daemon.c:51
0x7f98626a24af ???
        ???:0
0x407781 payment_result_infer
        plugins/libplugin-pay.c:631
0x40b060 payment_waitsendpay_finished
        plugins/libplugin-pay.c:662
0x406839 handle_rpc_reply
        plugins/libplugin.c:539
0x406966 rpc_read_response_one
        plugins/libplugin.c:662
0x4069ee rpc_conn_read_response
        plugins/libplugin.c:681
0x4242ea next_plan
        ccan/ccan/io/io.c:59
0x4247aa do_plan
        ccan/ccan/io/io.c:407
0x4247d5 io_ready
        ccan/ccan/io/io.c:417
0x425f70 io_loop
        ccan/ccan/io/poll.c:445
0x407449 plugin_main
        plugins/libplugin.c:1284
0x404739 main
        plugins/pay.c:2010
0x7f986268d82f ???
        ???:0
0x402e08 ???
        ???:0
0xffffffffffffffff ???
        ???:0
2020-07-17T09:24:45.608Z INFO plugin-pay: Killing plugin: Plugin exited before completing handshake.

Now pay doesn't work anymore, i think i should restart lightningd

ZmnSCPxj commented 4 years ago

Yes, pay plugin dying means the pay command disappears. We should really implement #3213 such that an important plugin dying also causes lightningd to kill itself, and hopefully automated systems by the operator can restart lightningd.

Anyway.... This crashed: https://github.com/ElementsProject/lightning/blob/fe119fc8fd2724e0b1660b063233dfd74e27bc04/plugins/libplugin-pay.c#L631

The code inherently assumes that r is non-NULL. So let us check the only caller to the function payment_result_infer:https://github.com/ElementsProject/lightning/blob/fe119fc8fd2724e0b1660b063233dfd74e27bc04/plugins/libplugin-pay.c#L661-L662

The r argument is the second argument, which is p->result, which we just got from the tal_sendpay_result_from_json call.

So let us dig into tal_sendpay_result_from_json: https://github.com/ElementsProject/lightning/blob/fe119fc8fd2724e0b1660b063233dfd74e27bc04/plugins/libplugin-pay.c#L591-L592

Since we enabled a warning for unused labels, and upgraded all warnings as errors, the existence of a fail label implies the existence of a goto fail somewhere else, and tal_free always returns NULL. This implies that we are not correctly handling the case where an error in the JSON result exists.

@cdecker is it safe to return in payment_result_infer immediately if r is NULL?

cdecker commented 4 years ago

@thestick613 sorry for this, I pushed a small partial fix to #3839, that'll tell us why we were unable to parse the waitsendpay result. Could you update to that and try again?

cdecker commented 4 years ago

Also notice that you can just reload the plugin after a crash by calling lightning-cli plugin start /path/to/pay and it'll not require a restart (also works if you recompiled the plugin in-between) :-)

thestick613 commented 4 years ago

I have updated, and i think i got the same result. Will paste only a part of the log:

2020-07-18T06:14:52.100Z INFO plugin-pay: Split into 397 sub-payments due to initial size (4000000000msat > 10000000msat)
2020-07-18T06:14:58.887Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (WIRE_TEMPORARY_CHANNEL_FAILURE: Too many HTLCs)
2020-07-18T06:14:58.892Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (WIRE_TEMPORARY_CHANNEL_FAILURE: Too many HTLCs)
2020-07-18T06:14:58.892Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (WIRE_TEMPORARY_CHANNEL_FAILURE: Too many HTLCs)
...
2020-07-18T06:14:58.936Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (WIRE_TEMPORARY_CHANNEL_FAILURE: Too many HTLCs)
2020-07-18T06:14:58.937Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (WIRE_TEMPORARY_CHANNEL_FAILURE: Too many HTLCs)
2020-07-18T06:14:58.944Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (WIRE_TEMPORARY_CHANNEL_FAILURE: Too many HTLCs)
2020-07-18T06:14:58.947Z UNUSUAL plugin-pay: Could not update the channel hint for 630000x2222x0/1. Could be a concurrent `getroute` call.
2020-07-18T06:14:58.953Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (WIRE_TEMPORARY_CHANNEL_FAILURE: Too many HTLCs)
2020-07-18T06:14:59.080Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (WIRE_TEMPORARY_CHANNEL_FAILURE: Too many HTLCs)
2020-07-18T06:14:59.080Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (WIRE_TEMPORARY_CHANNEL_FAILURE: Too many HTLCs)
...
2020-07-18T06:15:00.818Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (WIRE_TEMPORARY_CHANNEL_FAILURE: Too many HTLCs)
2020-07-18T06:15:00.818Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (WIRE_TEMPORARY_CHANNEL_FAILURE: Too many HTLCs)
2020-07-18T06:15:01.670Z INFO XXX-chan#74914: htlc 32 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-07-18T06:15:01.705Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)
2020-07-18T06:15:01.896Z INFO XXX-chan#74914: htlc 46 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-07-18T06:15:01.898Z INFO XXX-chan#74914: htlc 43 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-07-18T06:15:01.899Z INFO XXX-chan#74914: htlc 34 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
...
2020-07-18T06:15:01.941Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)
2020-07-18T06:15:01.951Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)
2020-07-18T06:15:01.952Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)
2020-07-18T06:15:01.953Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)
2020-07-18T06:15:01.953Z INFO plugin-pay: failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)
...
2020-07-18T06:15:06.539Z INFO yyy-chan#29416: htlc 233 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-07-18T06:15:07.150Z INFO yyy-chan#29416: htlc 234 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-07-18T06:15:07.694Z INFO yyy-chan#29416: htlc 235 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-07-18T06:15:07.970Z INFO yyy-chan#29416: htlc 236 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
pay: FATAL SIGNAL 11 (version v0.8.2-460-g055cfd1)
0x41548a send_backtrace
        common/daemon.c:38
0x415514 crashdump
        common/daemon.c:51
0x7fb3ad0784af ???
        ???:0
0x407781 payment_result_infer
        plugins/libplugin-pay.c:631
0x40b060 payment_waitsendpay_finished
        plugins/libplugin-pay.c:662
0x406839 handle_rpc_reply
        plugins/libplugin.c:539
0x406966 rpc_read_response_one
        plugins/libplugin.c:662
0x4069ee rpc_conn_read_response
        plugins/libplugin.c:681
0x424339 next_plan
        ccan/ccan/io/io.c:59
0x4247f9 do_plan
        ccan/ccan/io/io.c:407
0x424824 io_ready
        ccan/ccan/io/io.c:417
0x425fbf io_loop
        ccan/ccan/io/poll.c:445
0x407449 plugin_main
        plugins/libplugin.c:1284
0x404739 main
        plugins/pay.c:2010
0x7fb3ad06382f ???
        ???:0
0x402e08 ???
        ???:0
0xffffffffffffffff ???
        ???:0
2020-07-18T06:15:08.870Z INFO plugin-pay: Killing plugin: Plugin exited before completing handshake.
2020-07-18T06:15:09.043Z INFO yyy-chan#29416: htlc 237 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-07-18T06:15:09.206Z INFO yyy-chan#29416: htlc 238 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-07-18T06:15:09.859Z INFO yyy-chan#29416: htlc 239 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-07-18T06:15:09.859Z INFO yyy-chan#29416: htlc 242 failed from 0th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
thestick613 commented 4 years ago

Oh wait, did i need to manually merge https://github.com/ElementsProject/lightning/pull/3839 ?

cdecker commented 4 years ago

I'll merge it now, since we found another one (and the PR title was wrong)