ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.83k stars 899 forks source link

Testnet channel 1451585x82x0 crashes routing #3517

Closed s-tikhomirov closed 4 years ago

s-tikhomirov commented 4 years ago

Issue and Steps to Reproduce

I understand that this quite an edge case, still I'd like my lightningd not to crash.

I'm calling getroute from a plugin to find a route between src and dst. For my purposes, I put many restrictions on getroute for it to return the specific channel between the two nodes. Among other calls, I call getroute where the very channel between src and dst is excluded, and maxhops = 1. I expect it to either return with "No route found", or return a parallel channel between src and dst if one exists.

For one particular channel on Bitcoin testnet, I run into problems:

2020-02-13T12:28:22.131Z **BROKEN** gossipd: FATAL SIGNAL 6 (version v0.8.0-40-g899f5de)
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: common/daemon.c:46 (send_backtrace) 0x55fdc9118576
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: common/daemon.c:54 (crashdump) 0x55fdc91185c1
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f82920a5f1f
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f82920a5e97
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f82920a7800
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f8292097399
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f8292097411
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: gossipd/routing.c:1223 (find_shorter_route) 0x55fdc911166c
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: gossipd/routing.c:1353 (find_route) 0x55fdc911207e
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: gossipd/routing.c:2687 (get_route) 0x55fdc91134b8
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:919 (getroute_req) 0x55fdc910957f
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:1552 (recv_req) 0x55fdc910b30d
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: common/daemon_conn.c:31 (handle_read) 0x55fdc91186b0
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x55fdc9136ba1
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x55fdc9137090
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x55fdc91370bd
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: ccan/ccan/io/poll.c:445 (io_loop) 0x55fdc9138989
2020-02-13T12:28:22.132Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:1666 (main) 0x55fdc910b600
2020-02-13T12:28:22.132Z INFO 0303ba0fed62039df562a77bfcec887a9aa0767ff6519f6e2baa1309210544cd3d-channeld-chan#7: Peer connection lost
2020-02-13T12:28:22.133Z INFO 0303ba0fed62039df562a77bfcec887a9aa0767ff6519f6e2baa1309210544cd3d-chan#7: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2020-02-13T12:28:22.133Z INFO 0277622bf4c497475960bf91bd3c673a4cb4e9b589cebfde9700c197b3989cc1b8-channeld-chan#8: Peer connection lost
2020-02-13T12:28:22.133Z INFO 0277622bf4c497475960bf91bd3c673a4cb4e9b589cebfde9700c197b3989cc1b8-chan#8: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f8292088b96
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x55fdc9105299
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0xffffffffffffffff
2020-02-13T12:28:22.133Z **BROKEN** gossipd: FATAL SIGNAL (version v0.8.0-40-g899f5de)
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: common/daemon.c:46 (send_backtrace) 0x55fdc9118576
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: common/status.c:206 (status_failed) 0x55fdc911ba28
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: common/subdaemon.c:25 (status_backtrace_exit) 0x55fdc911bb20
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: common/daemon.c:57 (crashdump) 0x55fdc91185c7
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f82920a5f1f
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f82920a5e97
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f82920a7800
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f8292097399
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f8292097411
2020-02-13T12:28:22.133Z **BROKEN** gossipd: backtrace: gossipd/routing.c:1223 (find_shorter_route) 0x55fdc911166c
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: gossipd/routing.c:1353 (find_route) 0x55fdc911207e
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: gossipd/routing.c:2687 (get_route) 0x55fdc91134b8
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:919 (getroute_req) 0x55fdc910957f
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:1552 (recv_req) 0x55fdc910b30d
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: common/daemon_conn.c:31 (handle_read) 0x55fdc91186b0
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x55fdc9136ba1
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x55fdc9137090
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x55fdc91370bd
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: ccan/ccan/io/poll.c:445 (io_loop) 0x55fdc9138989
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: gossipd/gossipd.c:1666 (main) 0x55fdc910b600
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x7f8292088b96
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0x55fdc9105299
2020-02-13T12:28:22.134Z **BROKEN** gossipd: backtrace: (null):0 ((null)) 0xffffffffffffffff
2020-02-13T12:28:22.134Z **BROKEN** gossipd: STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL

I isolated the problem, this is the function which when called from a plugin produces the above effect (with smaller amount it's OK):

@plugin.method("test")
def test_crash(plugin, channel_id="1451585x82x0"):
    channel_pair = plugin.rpc.listchannels(channel_id)["channels"]
    src = channel_pair[0]["source"]
    dst = channel_pair[0]["destination"]
    exclude = [channel_id+"/0", channel_id+"/1"]
    amount =  4294840000 # crashes in /1
    #amount =  4294830000 # OK
    print("Finding route in /0")
    try:
        route = plugin.rpc.getroute(dst, amount, riskfactor=1, 
            fromid=src, exclude=exclude, maxhops=1)["route"]
    except RpcError as e:
        print(e)
        pass
    print("Finding route in /1")
    try:
        route = plugin.rpc.getroute(src, amount, riskfactor=1, 
            fromid=dst, exclude=exclude, maxhops=1)["route"]
    except RpcError as e:
        print(e)
        pass
    return

Note that my node is connected to 1451585x82x0's source (0277622bf4c497475960bf91bd3c673a4cb4e9b589cebfde9700c197b3989cc1b8) but not to destination (03367c6e59a9924d2eb188ce43413dd6c76db102c3c0516d51ccea324b490e3bfa). I'm also connected to 0303ba0fed62039df562a77bfcec887a9aa0767ff6519f6e2baa1309210544cd3d (its ID also appears in the crash trace).

getinfo output

{
   "id": "026e7cb4237b4042db75a991af29c2d7704780c8e7fbebe5fdbad8f38e958dfba5",
   "alias": "clightningM",
   "color": "026e7c",
   "num_peers": 5,
   "num_pending_channels": 0,
   "num_active_channels": 3,
   "num_inactive_channels": 3,
   "address": [],
   "binding": [
      {
         "type": "ipv4",
         "address": "127.0.0.1",
         "port": 9737
      }
   ],
   "version": "v0.8.0-40-g899f5de",
   "blockheight": 1665789,
   "network": "testnet",
   "msatoshi_fees_collected": 0,
   "fees_collected_msat": "0msat",
   "lightning-dir": "/home/sergei/.lightning/testnet"
}
m-schmoock commented 4 years ago

the very channel between src and dst is excluded, and maxhops = 1. I expect it to either return with "No route found", or return a parallel channel between src and dst if one exists.

Hmmm, ist your case assuming that there can be two or channels between two c-lightning instances?

For a lot of reasons, this is currently not supported (and likely never will), though the daemon should not crash in any case.

s-tikhomirov commented 4 years ago

assuming that there can be two or channels between two c-lightning instances

I'm assuming there can be two or more channels between two LN instances in general. I know c-lightning doesn't support this, but other implementations do (and the BOLTs allow it). Moreover, looking at network explorers, I see that this is not uncommon.

m-schmoock commented 4 years ago

im trying to build a dedicated testcase that triggers this, no luck yet...

m-schmoock commented 4 years ago

can you have a quick retry of your case on lastet master sources?

m-schmoock commented 4 years ago

Im trying something like this:

def test_find_parallel_route(node_factory, bitcoind):
    # SETUP
    #  l1 --> l2 <--+
    #  |            |
    #  +---> l3  ---+
    l1, l2, l3 = node_factory.get_nodes(3)
    l1.connect(l2)
    l1.connect(l3)
    l3.connect(l2)
    nodes = [l1, l2, l3]
    scid12 = l1.fund_channel(l2, 10**6)
    scid13 = l1.fund_channel(l3, 10**6)
    scid32 = l3.fund_channel(l2, 10**6)
    scids = [scid12, scid13, scid32]
    l1id = l1.rpc.getinfo()["id"]
    l2id = l2.rpc.getinfo()["id"]
    l3id = l3.rpc.getinfo()["id"]

    # wait for each others gossip
    bitcoind.generate_block(6)
    for n in nodes:
        for scid in scids:
            n.wait_channel_active(scid)

    # bring the channels to some balance, so routing has more options
    l1.pay(l2, Millisatoshi(10**9) / 2)
    l1.pay(l3, Millisatoshi(10**9) / 2)
    l3.pay(l2, Millisatoshi(10**9) / 2)

    amount = Millisatoshi(10**9) / 4

    # now try to find a one-hop alternative route from
    # l1 to l2 or vice versa on both nodes.
    with pytest.raises(RpcError, match=r'Could not find a route'):
        r = l1.rpc.getroute(
            l2id,
            amount,
            riskfactor=1,
            fromid=l1id,
            maxhops=1,
            exclude=[scid12 + '/0', scid12 + '/1'])

    with pytest.raises(RpcError, match=r'Could not find a route'):
        r = l1.rpc.getroute(
            l1id,
            amount,
            riskfactor=1,
            fromid=l2id,
            maxhops=1,
            exclude=[scid12 + '/0', scid12 + '/1'])

    with pytest.raises(RpcError, match=r'Could not find a route'):
        r = l2.rpc.getroute(
            l2id,
            amount,
            riskfactor=1,
            fromid=l1id,
            maxhops=1,
            exclude=[scid12 + '/0', scid12 + '/1'])

    with pytest.raises(RpcError, match=r'Could not find a route'):
        r = l2.rpc.getroute(
            l1id,
            amount,
            riskfactor=1,
            fromid=l2id,
            maxhops=1,
            exclude=[scid12 + '/0', scid12 + '/1'])
m-schmoock commented 4 years ago

Maybe related to the channels capacity, as you pointed out using an amount near to the total capacity 4294840000msat causes it for testnet 1451585x82x0 which has 5MegaSat capacity

m-schmoock commented 4 years ago

Nah, I cant reproduce this with clnd with such a testcase for the very reason it does not support multiple channels between peers. I need to construct the test in a way, that gossip sees parallel channels somewhere else.

s-tikhomirov commented 4 years ago

Adding a log with log-level=DEBUG:

logTest.txt

a quick retry of your case on lastet master sources

I'm a bit reluctant as I worry to break my current setup... Would I be able to roll back easily?

s-tikhomirov commented 4 years ago

for the very reason it does not support multiple channels between peers

But as far as I see on 1ML, this is the only channel between those peers...

Maybe related to the channels capacity

But it's not that close to the limit, 700+k sats left to total capacity. It's close to the max HTLC amount of 4294967295msat, that's true.

m-schmoock commented 4 years ago

yup its just one between those particular nodes. The assertion that raises is:

gossipd/routing.c +1223

    SUPERVERBOSE("Running shortest path from %s -> %s",
             type_to_string(tmpctx, struct node_id, &dst->id),
             type_to_string(tmpctx, struct node_id, &src->id));
    dijkstra(rstate, dst, NULL, riskfactor, 1, fuzz, base_seed,
         unvisited, shortest_cost_function);
    dijkstra_cleanup(unvisited);

    /* This must succeed, since we found a route before */
    short_route = build_route(ctx, rstate, dst, src, me, riskfactor, 1,
                  fuzz, base_seed, fee);
    assert(short_route);
rustyrussell commented 4 years ago

Reproduced here:

$ lightning-cli getroute id=0277622bf4c497475960bf91bd3c673a4cb4e9b589cebfde9700c197b3989cc1b8 fromid=03367c6e59a9924d2eb188ce43413dd6c76db102c3c0516d51ccea324b490e3bfa exclude='[1451585x82x0/0, 1451585x82x0/1]' msatoshi=4294840000 maxhops=1 riskfactor=1
lightning-cli: reading response: socket closed
s-tikhomirov commented 4 years ago

Found another case: 1665803x253x1/0 with amount 988781200msat (988781100 is OK): logTest.txt.