ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.84k stars 901 forks source link

Taken pointer in lightning_gossipd triggered by sendinvoiceless plugin #3757

Closed ghost closed 4 years ago

ghost commented 4 years ago

Issue and Steps to Reproduce

EDIT: showing the contents of the output file instead of the link:

2020-05-31T19:40:02.694Z INFO plugin-sendinvoiceless.py:     - 02ad6fb8d693dc1e4569bcedefadf5f72a931ae027dc0f0c544b34c1c6f3b9a02b   623064x1516x0  1004msat
2020-05-31T19:40:02.694Z INFO plugin-sendinvoiceless.py:     - 0231eccc6510eb2e1c97c8a190d6ea096784aa7c358355442055aac8b20654f932   615503x2435x0  1002msat
2020-05-31T19:40:02.694Z INFO plugin-sendinvoiceless.py:     - 02888244029c5909593038ab19f269947c720de3423e491791b46c7c92f76279b6   587187x1846x1  1000msat
2020-05-31T19:40:02.696Z INFO lightningd: Sending 39999046msat over 6 hops to deliver 1000msat
2020-05-31T19:40:31.300Z INFO 0231eccc6510eb2e1c97c8a190d6ea096784aa7c358355442055aac8b20654f932-chan#16782: htlc 1346 failed from 2th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-05-31T19:40:32.614Z INFO plugin-sendinvoiceless.py: RpcError: RPC call failed: method: waitsendpay, payload: {'payment_hash': '23b2a502c626dc752f91a42a8aa10ae1be905e8e2a8d55a49bbb0cb030be7841', 'timeout': 6}, error: {'code': 204, 'message': 'failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)', 'data': {'id': 190, 'payment_hash': '23b2a502c626dc752f91a42a8aa10ae1be905e8e2a8d55a49bbb0cb030be7841', 'destination': '02888244029c5909593038ab19f269947c720de3423e491791b46c7c92f76279b6', 'msatoshi': 1000, 'amount_msat': 1000msat, 'msatoshi_sent': 39999046, 'amount_sent_msat': 39999046msat, 'created_at': 1590954025, 'status': 'pending', 'erring_index': 3, 'failcode': 4103, 'failcodename': 'WIRE_TEMPORARY_CHANNEL_FAILURE', 'erring_node': '021f7430cce1b1ec01675bd575c0785ee2150d1d3167588b7e52a8050d2df2d28e', 'erring_channel': '623064x1516x0', 'erring_direction': 0, 'raw_message': '10070000'}}
lightning_gossipd: Outstanding taken pointers: (nil)
lightningd: gossipd failed (exit status 1), exiting.
pay: Lost connection to the RPC socket.
autoclean: Lost connection to the RPC socket.

EDIT: lightning_gossipd: Outstanding taken pointers: (nil) doesn't sound good to me ;-)

getinfo output

"version": "v0.8.2-178-g8d98b99"

cdecker commented 4 years ago

This is very strange indeed. The code that is reporting an error is this:

https://github.com/ElementsProject/lightning/blob/8d98b99bde55ad0844761b584311543c9051b6f7/common/daemon.c#L81-L92

(permalink to the commit you reported, but it is unchanged since then)

The code tests whether we have any taken pointers, assigning to t, checks whether t is NULL, and if not it aborts with an error message. From the error message it seems that the pointer is NULL / (nil), so it should not have triggered the abort, but it does.

I have no idea how that could happen. Paging Dr @rustyrussell :-)

ZmnSCPxj commented 4 years ago

It could be taken_any returns a non-NULL pointer to the string "(nil)". So possibly we may need to look inside at taken_any as well.

ZmnSCPxj commented 4 years ago

Looking at taken_any, it does a sprintf(buffer, "%p", takenarr[0]). So it could indeed end up pointing to a string "(nil)", if takenarr[0] is NULL, num_taken is not 0, and there is no labelarr. This implies somebody did a take on a NULL, I think.

ghost commented 4 years ago

Haha, indeed:

#include <stdio.h>

char* taken_any();

int main()
{
  const char *t;

  t = taken_any();

  if (t)
    printf("Outstanding taken pointers: %s\n", t);

  return 0;
}

char* taken_any() {

  static char pointer_buf[32];
  static char* takenarr;

  takenarr=NULL;

  sprintf(pointer_buf, "%p", takenarr);

  return pointer_buf;

}

prints: Outstanding taken pointers: (nil)

ghost commented 4 years ago

By the way, here is another one from some time ago which maybe gives a little more information after the "(nil)".

EDIT: showing the contents of the output file instead of the link:

2020-06-15T04:12:26.389Z INFO plugin-sendinvoiceless.py:     - 02888244029c5909593038ab19f269947c720de3423e491791b46c7c92f76279b6   613420x1938x0  1000msat
2020-06-15T04:12:26.390Z INFO lightningd: Sending 40009003msat over 6 hops to deliver 1000msat
2020-06-15T04:12:33.386Z INFO 032679fec1213e5b0a23e066c019d7b991b95c6e4d28806b9ebd1362f9e32775cf-chan#46115: htlc 31 failed from 2th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-06-15T04:12:33.734Z INFO plugin-sendinvoiceless.py: RpcError: RPC call failed: method: waitsendpay, payload: {'payment_hash': '271163d996bb2fda734a70ea2f5ee08a5ed89992289f81b2bd93a5dfabb8a2e2', 'timeout': 46}, error: {'code': 204, 'message': 'failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)', 'data': {'id': 245, 'payment_hash': '271163d996bb2fda734a70ea2f5ee08a5ed89992289f81b2bd93a5dfabb8a2e2', 'destination': '02888244029c5909593038ab19f269947c720de3423e491791b46c7c92f76279b6', 'msatoshi': 1000, 'amount_msat': 1000msat, 'msatoshi_sent': 40009003, 'amount_sent_msat': 40009003msat, 'created_at': 1592194346, 'status': 'pending', 'erring_index': 3, 'failcode': 4103, 'failcodename': 'WIRE_TEMPORARY_CHANNEL_FAILURE', 'erring_node': '021f7430cce1b1ec01675bd575c0785ee2150d1d3167588b7e52a8050d2df2d28e', 'erring_channel': '623356x854x1', 'erring_direction': 0, 'raw_message': '10070000'}}
lightning_gossipd: Outstanding taken pointers: (nil)
2020-06-15T04:12:33.841Z INFO 03e691f81f08c56fa876cc4ef5c9e8b727bd682cf35605be25d48607a802526053-channeld-chan#48650: Peer connection lost
2020-06-15T04:12:33.842Z INFO 03e691f81f08c56fa876cc4ef5c9e8b727bd682cf35605be25d48607a802526053-chan#48650: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2020-06-15T04:12:33.842Z INFO 021d2436cab847373a4212bf6d754ead5304f5d0791479643893a837b295f3441c-channeld-chan#30848: Peer connection lost
2020-06-15T04:12:33.843Z INFO 021d2436cab847373a4212bf6d754ead5304f5d0791479643893a837b295f3441c-chan#30848: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (9)
2020-06-15T04:12:33.843Z INFO 03c4ffc2c7396f9df945f7411c34eb3ad98fcc8429afe79458a7a682ff6cf06859-channeld-chan#48578: Peer connection lost
2020-06-15T04:12:33.843Z INFO 03c4ffc2c7396f9df945f7411c34eb3ad98fcc8429afe79458a7a682ff6cf06859-chan#48578: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
lightningd: gossipd failed (exit status 1), exiting.
keysend: Lost connection to the RPC socket.
autoclean: Lost connection to the RPC socket.
fundchannel: Lost connection to the RPC socket.
pay: Lost connection to the RPC socket.
ZmnSCPxj commented 4 years ago

A part that might be the culprit:

https://github.com/ElementsProject/lightning/blob/5db69f1b414f04c204f086424c96d10c20a8be39/gossipd/gossipd.c#L1499

The channel_update is loaded from this:

https://github.com/ElementsProject/lightning/blob/5db69f1b414f04c204f086424c96d10c20a8be39/gossipd/gossipd.c#L1481-L1497

Those fromwire_ functions will load NULL into channel_update if the length of the channel update is 0 (which should not happen, but shrug it comes from remote node, which might be buggy or whatev).

The interesting bit is your logs:

2020-06-15T04:12:33.734Z INFO plugin-sendinvoiceless.py: RpcError: RPC call failed: method: waitsendpay, payload: {'payment_hash': '271163d996bb2fda734a70ea2f5ee08a5ed89992289f81b2bd93a5dfabb8a2e2', 'timeout': 46}, error: {'code': 204, 'message': 'failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)', 'data': {'id': 245, 'payment_hash': '271163d996bb2fda734a70ea2f5ee08a5ed89992289f81b2bd93a5dfabb8a2e2', 'destination': '02888244029c5909593038ab19f269947c720de3423e491791b46c7c92f76279b6', 'msatoshi': 1000, 'amount_msat': 1000msat, 'msatoshi_sent': 40009003, 'amount_sent_msat': 40009003msat, 'created_at': 1592194346, 'status': 'pending', 'erring_index': 3, 'failcode': 4103, 'failcodename': 'WIRE_TEMPORARY_CHANNEL_FAILURE', 'erring_node': '021f7430cce1b1ec01675bd575c0785ee2150d1d3167588b7e52a8050d2df2d28e', 'erring_channel': '623356x854x1', 'erring_direction': 0, 'raw_message': '10070000'}}

See the raw_message bit? It says 10070000. That is hex: two bytes of message ID (0x1007) followed by two 0 bytes of length --- precisely what would cause fromwire_temporary_channel_failure to set channel_update to NULL.

Note in particular 0x1007 is:

https://github.com/ElementsProject/lightning/blob/5db69f1b414f04c204f086424c96d10c20a8be39/wire/extracted_onion_wire_csv#L21-L23

UPDATE is defined here:

https://github.com/ElementsProject/lightning/blob/5db69f1b414f04c204f086424c96d10c20a8be39/wire/onion_defs.h#L17

Thus WIRE_TEMPORARY_CHANNEL_FAILURE is 0x1007 (UPDATE|7).

We see as well that the bytes after the message code is the length of the channel update, which the remote node gave as 0, which causes the generated fromwire_ function to set channel_update to NULL, whih gets passed into take and possibly utterly destroys the take system (maybe).

So what we can do is try to add this code:

if (!channel_update)
        return NULL;

before:

https://github.com/ElementsProject/lightning/blob/5db69f1b414f04c204f086424c96d10c20a8be39/gossipd/gossipd.c#L1499

And try to see if sendinvoiceless triggers the bug again or not.

ZmnSCPxj commented 4 years ago

This seems further corroborated by inspecting patch_channel_update:

https://github.com/ElementsProject/lightning/blob/5db69f1b414f04c204f086424c96d10c20a8be39/gossipd/gossipd.c#L1450-L1466

If channel_update is a taken pointer to NULL, then it goes into the else branch, which just calls into tal_dup_talarr:

https://github.com/ElementsProject/lightning/blob/5db69f1b414f04c204f086424c96d10c20a8be39/gossipd/gossipd.c#L1463-L1465

The implementation of tal_dup_talarr is:

https://github.com/ElementsProject/lightning/blob/5db69f1b414f04c204f086424c96d10c20a8be39/common/utils.c#L145-L150

The culprit is this:

https://github.com/ElementsProject/lightning/blob/5db69f1b414f04c204f086424c96d10c20a8be39/common/utils.c#L147-L148

If the input pointer is NULL we return NULL as well, BUT if it is a taken NULL pointer, we do not pass it to the taken function, which is needed in order to clear the taken-ness of NULL.

So a better fix would be:

void *tal_dup_talarr_(const tal_t *ctx, const tal_t *src, const char *label)   
{       
        if (!src) {
                (void) taken(src);
                return NULL; 
        }
        return tal_dup_(ctx, src, 1, tal_bytelen(src), 0, label);              
}

This makes tal_dup_talarr correctly take NULL inputs.

ghost commented 4 years ago

@ZmnSCPxj thanks for all the explanations, really appreciated. Of course I don't get all the details but the general idea is clear: a take of a NULL pointer should be handled correctly. It will be hard to test whether this solves my issue because I have the impression the problematic node (021f7430cce1b1ec01675bd575c0785ee2150d1d3167588b7e52a8050d2df2d28e) has solved his issue. He wins my trading game (too :-)) often and gets his reward via sendinvoiceless but lately all payments go well. So probably we were "lucky" to see this bug. Anyway, on logical grounds the commit seems to solve the issue and IMHO the issue can therefore be closed.

ZmnSCPxj commented 4 years ago

Hmm. Well it "should" be possible to write a plugin used in testing that will force an onion failure of WIRE_TEMPORARY_CHANNEL_FAILURE with a length-0 channel_update in htlc_accepted hook, which should tickle this bug as well. Let me try to see if I can tickle that bug this way.

ZmnSCPxj commented 4 years ago

Got a similar message with the proposed test. https://travis-ci.org/github/ZmnSCPxj/lightning/jobs/700837757#L2572

ghost commented 4 years ago

I found it in the log looking for "(nil)". Nice to know the bug can be reproduced (and it's not "just me" :-)). When I restarted after the crash, c-lightning rebuild the gossip_store and moved the old one to gossip_store.corrupt. Can you see how the two issues are related? To check whether they are really related, I looked again at the dates of the original files:

-rw-rw-r-- 1 user user 668185 mei 31 21:41 out.txt_crash
-rw------- 1 user user 23807951 mei 31 21:43 /home/user/.lightning/bitcoin/gossip_store.corrupt

As you can see, I restarted c-lightning a few minutes after the crash and indeed it then moved the gossip_store to gossip_store.corrupt. But if I remember correctly sometimes after the "(nil)" crash the gossip_store was still fine.

ZmnSCPxj commented 4 years ago

I think we have the conservative policy that any crash of lightning_gossipd might corrupt the gossip store, so we assume the gossip store is corrupted. It might not necessarily be corrupted. Though do note that the code I pointed out is what extracts a new channel_update from a payment failure, and we generally store channel_updates into the store, so the abort at that point might also be what lead to corrupted gossip stores (e.g. we queue up some updates into the gossip store, then we crash while processing one of them because of this "taken NULL" bug, preventing us from properly writing out the correct data to the gossip store). Not certain; speculation.