Closed ghost closed 4 years ago
This is very strange indeed. The code that is reporting an error is this:
(permalink to the commit you reported, but it is unchanged since then)
The code tests whether we have any taken pointers, assigning to t
, checks whether t
is NULL
, and if not it aborts with an error message. From the error message it seems that the pointer is NULL
/ (nil)
, so it should not have triggered the abort, but it does.
I have no idea how that could happen. Paging Dr @rustyrussell :-)
It could be taken_any
returns a non-NULL
pointer to the string "(nil)"
. So possibly we may need to look inside at taken_any
as well.
Looking at taken_any
, it does a sprintf(buffer, "%p", takenarr[0])
. So it could indeed end up pointing to a string "(nil)"
, if takenarr[0]
is NULL, num_taken
is not 0, and there is no labelarr
. This implies somebody did a take
on a NULL
, I think.
Haha, indeed:
#include <stdio.h>
char* taken_any();
int main()
{
const char *t;
t = taken_any();
if (t)
printf("Outstanding taken pointers: %s\n", t);
return 0;
}
char* taken_any() {
static char pointer_buf[32];
static char* takenarr;
takenarr=NULL;
sprintf(pointer_buf, "%p", takenarr);
return pointer_buf;
}
prints: Outstanding taken pointers: (nil)
By the way, here is another one from some time ago which maybe gives a little more information after the "(nil)".
EDIT: showing the contents of the output file instead of the link:
2020-06-15T04:12:26.389Z INFO plugin-sendinvoiceless.py: - 02888244029c5909593038ab19f269947c720de3423e491791b46c7c92f76279b6 613420x1938x0 1000msat
2020-06-15T04:12:26.390Z INFO lightningd: Sending 40009003msat over 6 hops to deliver 1000msat
2020-06-15T04:12:33.386Z INFO 032679fec1213e5b0a23e066c019d7b991b95c6e4d28806b9ebd1362f9e32775cf-chan#46115: htlc 31 failed from 2th node with code 0x1007 (WIRE_TEMPORARY_CHANNEL_FAILURE)
2020-06-15T04:12:33.734Z INFO plugin-sendinvoiceless.py: RpcError: RPC call failed: method: waitsendpay, payload: {'payment_hash': '271163d996bb2fda734a70ea2f5ee08a5ed89992289f81b2bd93a5dfabb8a2e2', 'timeout': 46}, error: {'code': 204, 'message': 'failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)', 'data': {'id': 245, 'payment_hash': '271163d996bb2fda734a70ea2f5ee08a5ed89992289f81b2bd93a5dfabb8a2e2', 'destination': '02888244029c5909593038ab19f269947c720de3423e491791b46c7c92f76279b6', 'msatoshi': 1000, 'amount_msat': 1000msat, 'msatoshi_sent': 40009003, 'amount_sent_msat': 40009003msat, 'created_at': 1592194346, 'status': 'pending', 'erring_index': 3, 'failcode': 4103, 'failcodename': 'WIRE_TEMPORARY_CHANNEL_FAILURE', 'erring_node': '021f7430cce1b1ec01675bd575c0785ee2150d1d3167588b7e52a8050d2df2d28e', 'erring_channel': '623356x854x1', 'erring_direction': 0, 'raw_message': '10070000'}}
lightning_gossipd: Outstanding taken pointers: (nil)
2020-06-15T04:12:33.841Z INFO 03e691f81f08c56fa876cc4ef5c9e8b727bd682cf35605be25d48607a802526053-channeld-chan#48650: Peer connection lost
2020-06-15T04:12:33.842Z INFO 03e691f81f08c56fa876cc4ef5c9e8b727bd682cf35605be25d48607a802526053-chan#48650: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2020-06-15T04:12:33.842Z INFO 021d2436cab847373a4212bf6d754ead5304f5d0791479643893a837b295f3441c-channeld-chan#30848: Peer connection lost
2020-06-15T04:12:33.843Z INFO 021d2436cab847373a4212bf6d754ead5304f5d0791479643893a837b295f3441c-chan#30848: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (9)
2020-06-15T04:12:33.843Z INFO 03c4ffc2c7396f9df945f7411c34eb3ad98fcc8429afe79458a7a682ff6cf06859-channeld-chan#48578: Peer connection lost
2020-06-15T04:12:33.843Z INFO 03c4ffc2c7396f9df945f7411c34eb3ad98fcc8429afe79458a7a682ff6cf06859-chan#48578: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
lightningd: gossipd failed (exit status 1), exiting.
keysend: Lost connection to the RPC socket.
autoclean: Lost connection to the RPC socket.
fundchannel: Lost connection to the RPC socket.
pay: Lost connection to the RPC socket.
A part that might be the culprit:
The channel_update
is loaded from this:
Those fromwire_
functions will load NULL
into channel_update
if the length of the channel update is 0 (which should not happen, but shrug it comes from remote node, which might be buggy or whatev).
The interesting bit is your logs:
2020-06-15T04:12:33.734Z INFO plugin-sendinvoiceless.py: RpcError: RPC call failed: method: waitsendpay, payload: {'payment_hash': '271163d996bb2fda734a70ea2f5ee08a5ed89992289f81b2bd93a5dfabb8a2e2', 'timeout': 46}, error: {'code': 204, 'message': 'failed: WIRE_TEMPORARY_CHANNEL_FAILURE (reply from remote)', 'data': {'id': 245, 'payment_hash': '271163d996bb2fda734a70ea2f5ee08a5ed89992289f81b2bd93a5dfabb8a2e2', 'destination': '02888244029c5909593038ab19f269947c720de3423e491791b46c7c92f76279b6', 'msatoshi': 1000, 'amount_msat': 1000msat, 'msatoshi_sent': 40009003, 'amount_sent_msat': 40009003msat, 'created_at': 1592194346, 'status': 'pending', 'erring_index': 3, 'failcode': 4103, 'failcodename': 'WIRE_TEMPORARY_CHANNEL_FAILURE', 'erring_node': '021f7430cce1b1ec01675bd575c0785ee2150d1d3167588b7e52a8050d2df2d28e', 'erring_channel': '623356x854x1', 'erring_direction': 0, 'raw_message': '10070000'}}
See the raw_message
bit? It says 10070000
. That is hex: two bytes of message ID (0x1007) followed by two 0 bytes of length --- precisely what would cause fromwire_temporary_channel_failure
to set channel_update
to NULL
.
Note in particular 0x1007 is:
UPDATE
is defined here:
Thus WIRE_TEMPORARY_CHANNEL_FAILURE
is 0x1007 (UPDATE|7
).
We see as well that the bytes after the message code is the length of the channel update, which the remote node gave as 0, which causes the generated fromwire_
function to set channel_update
to NULL
, whih gets passed into take
and possibly utterly destroys the take
system (maybe).
So what we can do is try to add this code:
if (!channel_update)
return NULL;
before:
And try to see if sendinvoiceless
triggers the bug again or not.
This seems further corroborated by inspecting patch_channel_update
:
If channel_update
is a take
n pointer to NULL
, then it goes into the else
branch, which just calls into tal_dup_talarr
:
The implementation of tal_dup_talarr
is:
The culprit is this:
If the input pointer is NULL
we return NULL
as well, BUT if it is a take
n NULL
pointer, we do not pass it to the taken
function, which is needed in order to clear the take
n-ness of NULL
.
So a better fix would be:
void *tal_dup_talarr_(const tal_t *ctx, const tal_t *src, const char *label)
{
if (!src) {
(void) taken(src);
return NULL;
}
return tal_dup_(ctx, src, 1, tal_bytelen(src), 0, label);
}
This makes tal_dup_talarr
correctly take NULL
inputs.
@ZmnSCPxj thanks for all the explanations, really appreciated. Of course I don't get all the details but the general idea is clear: a take of a NULL pointer should be handled correctly. It will be hard to test whether this solves my issue because I have the impression the problematic node (021f7430cce1b1ec01675bd575c0785ee2150d1d3167588b7e52a8050d2df2d28e) has solved his issue. He wins my trading game (too :-)) often and gets his reward via sendinvoiceless
but lately all payments go well. So probably we were "lucky" to see this bug. Anyway, on logical grounds the commit seems to solve the issue and IMHO the issue can therefore be closed.
Hmm. Well it "should" be possible to write a plugin used in testing that will force an onion failure of WIRE_TEMPORARY_CHANNEL_FAILURE
with a length-0 channel_update
in htlc_accepted
hook, which should tickle this bug as well. Let me try to see if I can tickle that bug this way.
Got a similar message with the proposed test. https://travis-ci.org/github/ZmnSCPxj/lightning/jobs/700837757#L2572
I found it in the log looking for "(nil)". Nice to know the bug can be reproduced (and it's not "just me" :-)). When I restarted after the crash, c-lightning rebuild the gossip_store
and moved the old one to gossip_store.corrupt
. Can you see how the two issues are related? To check whether they are really related, I looked again at the dates of the original files:
-rw-rw-r-- 1 user user 668185 mei 31 21:41 out.txt_crash
-rw------- 1 user user 23807951 mei 31 21:43 /home/user/.lightning/bitcoin/gossip_store.corrupt
As you can see, I restarted c-lightning a few minutes after the crash and indeed it then moved the gossip_store
to gossip_store.corrupt
. But if I remember correctly sometimes after the "(nil)" crash the gossip_store
was still fine.
I think we have the conservative policy that any crash of lightning_gossipd
might corrupt the gossip store, so we assume the gossip store is corrupted. It might not necessarily be corrupted. Though do note that the code I pointed out is what extracts a new channel_update
from a payment failure, and we generally store channel_update
s into the store, so the abort at that point might also be what lead to corrupted gossip stores (e.g. we queue up some updates into the gossip store, then we crash while processing one of them because of this "taken NULL" bug, preventing us from properly writing out the correct data to the gossip store). Not certain; speculation.
Issue and Steps to Reproduce
EDIT: showing the contents of the output file instead of the link:
EDIT:
lightning_gossipd: Outstanding taken pointers: (nil)
doesn't sound good to me ;-)getinfo
output"version": "v0.8.2-178-g8d98b99"