[TRIAGE] New failures following migration to BOLT11 style payment requests

cdecker commented 6 years ago

Ok, just ran the direct and forwarded payments which were migrated, and the results are rather disappointing. The following is a list of failures linked to the appropriate logs. I haven't yet done a root analysis on too many of them but I thought I'd share the logs early on in the hope that we can ping down the root causes together and spin off specific issues for the various causes.

Ping @Roasbeef, @rustyrussell, @sstone , @cdecker, and @pm47

cdecker commented 6 years ago

The following versions were used (master at the time of the test):

ACINQ/eclair@d0e33f23e9ecb2752a9f35886e29982db8515fc6
ElementsProject/lightning@9fd81ab06daacdeeac859e414d46a6ed8f5159c6
lightningnetwork/lnd@1b69940a848152def1264629d369b3733a2d2319

With the test version 65f8689470124d0575efd5218f7ef60be18ca2ee

Roasbeef commented 6 years ago

Haven't yet dug into the logs yet, but @sstone was by the other day and we did a bit of testing. Initially direct payment wasn't working, but we realized that it was due to eclair not passing through the custom cltv-final-expiry-delta (or w/e we're calling it now lol) when crafting the route for the HTLC. Currently, lnd always sets a custom value for this field. After Fabrice modified his local branch to respect the value, direct payments worked properly. I don't think he's pushed out the fix to their current master branch yet though.

The first failing case illustrates this (eclair -> lnd):

DEBUG:root:lnd: 2017-11-07 16:25:55.160 [ERR] HSWC: Onion payload of incoming htlc(a3991585731deba0ec4021f3105c53bd42d171bf931c207eac71fbec443c4e9f) has incorrect time-lock: expected 643, got 507

sstone commented 6 years ago

Yes I'll push a fix which solves some of these issues but not all of them

Roasbeef commented 6 years ago

For the failing lnd -> c-lightning, it looks like y'all rejected the HTLC lnd sent due to a timelock too near. Looking at the logs, I see we sent an HTLC with a timelock of 561:

DEBUG:root:lnd:  Expiry: (uint32) 561,

When sending the payment, lnd thinks the current height is 555:

DEBUG:root:lnd: 2017-11-07 16:30:04.734 [INF] CRTR: Block 3454f76d97afb6acea61bf20356f9942719b0cc5e774fbb489d8e2fb3b925e37 (height=555) closed 0 channels

While c-lightning (or is that just bitcoind) receives a block at height 556:

DEBUG:root:bitcoind: 2017-11-07 15:30:07 UpdateTip: new best=385973a9d0be93a31d4f7ed5df657d35e42f7bf97b01c688fe2f62718f5fd0cf height=556 version=0x20000000 log2_work=10.121534 tx=587 date='2017-11-07 15:30:07' progress=1.000000 cache=0.0MiB(111txo)

Decoding the payment request on my side I see it sets a final-cltv of 6:

$ ~/go/src/github.com/lightningnetwork/lnd/cmd/lncli# lncli decodepayreq lntb10u1pdqr5tlpp5xk7pv08pym0edqjws5nrzlrx3pplj5duhfzxwsnjs6kk3ylq034sdqjv3jhxcmjd9c8g6t0dccqpx97zj5aepwcmkv5hct6dlgah4g4q53mccgps8r7zkfafuxfdc88qqm6q4tlqjtdyqnu4wwfuswnnjezya8hxpp7qwrxdmgmj99vkxvhspag8d8e
{
    "destination": "02d54ce4d554a22504df955722538fe85a368ec5fc0333ea837a0eb350935b8e9c",
    "payment_hash": "35bc163ce126df96824e8526317c668843f951bcba4467427286ad6893e07c6b",
    "num_satoshis": "1000",
    "timestamp": "1510068607",
    "expiry": "3600",
    "description": "description",
    "description_hash": "",
    "fallback_addr": "",
    "cltv_expiry": "6"
}

So from lnd's PoV, the timelock is set is correct as: 555 + 6 = 561. So I think the issue is just a lack of synchronization between mining the blocks and waiting for both nodes to detect the same block height. We don't currently add any additional delay, yet (for shadow routes, as if we did this "tightness" wouldn't matter).

Roasbeef commented 6 years ago

Looking at test_forwarded_payment[lightning_eclair_lnd], it looks like c-lightning is unable to find a path to lnd once the dust has settled. Looking at the logs, I see lnd (eventually) send out the full announcement messages:

DEBUG:root:lnd: 2017-11-07 16:55:23.525 [INF] DISC: Fully valid channel proof for short_chan_id=1006053139480576 constructed, adding to next ann batch
DEBUG:root:lnd: 2017-11-07 16:55:23.783 [DBG] PEER: Sending ChannelAnnouncement(chain_hash=0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206, short_chan_id=1006053139480576) to 127.0.0.1:41664
DEBUG:root:lnd: 2017-11-07 16:55:23.784 [DBG] PEER: Sending ChannelUpdate(chain_hash=0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206, short_chan_id=1006053139480576, update_time=2017-11-07 16:55:05 +0100 CET) to 127.0.0.1:41664

It also then gets the chan ann messages from eclair:

DEBUG:root:lnd: 2017-11-07 16:55:25.568 [DBG] PEER: Received ChannelUpdate(chain_hash=0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206, short_chan_id=1004953627852800, update_time=2017-11-07 16:55:23 +0100 CET) from 127.0.0.1:41664
DEBUG:root:lnd: 2017-11-07 16:55:25.555 [DBG] PEER: Received ChannelAnnouncement(chain_hash=0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206, short_chan_id=1004953627852800) from 127.0.0.1:41664

After that, several blocks are mined but c-lightning is unable to find a path:

DEBUG:root:Received response for pay call: {'error': 'Could not find a route', 'jsonrpc': '2.0', 'id': 0}

One thing to note is that lnd will immediately announce after the number of confirmations has been reached, rather than also waiting for at least 6 blocks to elapse. This should be fixed later this week. Another thing I just noticed is that we don't currently send out a NodeAnnouncement as well when initially announcing the channel, perhaps that's causing some issues?

Roasbeef commented 6 years ago

Looking at test_forwarded_payment[lightning_lnd_lightning], it looks like we only detect that we can propagate all the announcements for a single channel:

DEBUG:root:lnd: 2017-11-07 17:04:32.112 [INF] DISC: Fully valid channel proof for short_chan_id=1177576953413632 constructed, adding to next ann batch
DEBUG:root:lnd: 2017-11-07 17:04:32.286 [INF] DISC: Broadcasting batch of 2 new announcements
DEBUG:root:lnd: 2017-11-07 17:04:32.286 [DBG] SRVR: Broadcasting 2 messages
DEBUG:root:lnd: 2017-11-07 17:04:32.286 [DBG] PEER: Sending ChannelAnnouncement(chain_hash=0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206, short_chan_id=1177576953413632) to 127.0.0.1:56780
DEBUG:root:lnd: 2017-11-07 17:04:32.286 [DBG] PEER: Sending ChannelAnnouncement(chain_hash=0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206, short_chan_id=1177576953413632) to 127.0.0.1:16333
DEBUG:root:lnd: 2017-11-07 17:04:32.286 [TRC] PEER: writeMessage to 127.0.0.1:56780: (*lnwire.ChannelAnnouncement)(0xc420164700)({

DEBUG:root:lnd: 2017-11-07 17:04:32.286 [DBG] PEER: Sending ChannelUpdate(chain_hash=0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206, short_chan_id=1177576953413632, update_time=2017-11-07 17:04:16 +0100 CET) to 127.0.0.1:56780

As a result, we never send the announcement messages for the second channel. This seems related to #5, will look into it.

Roasbeef commented 6 years ago

Tracking down test_forwarded_payment[lightning_eclair_lnd] a bit more, we only get one AnnounceSignatures message, which is the channel that ends up being fully propagated. As a result, from our PoV we can announce the second channel yet.

DEBUG:root:lnd: 2017-11-07 17:04:32.086 [DBG] PEER: Received AnnounceSignatures(chan_id=34ff3d5a97d9cdcfb065bb849fba12976c84c3366edecdc39216919667004169, short_chan_id=1177576953413632) from 127.0.0.1:56780

I also only see a single instance of:

DEBUG:root:lightningd: lightning_channeld(12163): TRACE: peer_out WIRE_ANNOUNCEMENT_SIGNATURES

Which I'm guessing means c-lightning is about to ann the channel.

cdecker commented 6 years ago

Wow, that's good work guys :-)

For the failing lnd -> c-lightning, it looks like y'all rejected the HTLC lnd sent due to a timelock too near. Looking at the logs, I see we sent an HTLC with a timelock of 561 So from lnd'sPoV, the timelock is set is correct as:555 + 6 = 561`. So I think the issue is just a lack of synchronization between mining the blocks and waiting for both nodes to detect the same block height. We don't currently add any additional delay, yet (for shadow routes, as if we did this "tightness" wouldn't matter).

Ok, will implement a sync barrier to sync btcd, bitcoind, and nodes to the same height.

After that, several blocks are mined but c-lightning is unable to find a path:

Could that be a gossip propagation delay? Maybe I should check that we can actually find a route before attempting to send the actual payment.

One thing to note is that lnd will immediately announce after the number of confirmations has been reached, rather than also waiting for at least 6 blocks to elapse. This should be fixed later this week. Another thing I just noticed is that we don't currently send out a NodeAnnouncement as well when initially announcing the channel, perhaps that's causing some issues?

I think we announce immediately after exchanging the sigs and reaching the depth of 6, though we might want to announce sooner for local operations. It shouldn't really matter, we simply wait for both conditions to be true.

Tracking down test_forwarded_payment[lightning_eclair_lnd] a bit more, we only get one AnnounceSignatures message, which is the channel that ends up being fully propagated.

I assume that is the eclair -> lnd link, since you shouldn't see the raw signatures for the other link. Indeed the other link is exchanged correctly:

DEBUG:root:lightningd: lightning_channeld(21914): TRACE: peer_out WIRE_ANNOUNCEMENT_SIGNATURES
DEBUG:root:lightningd: lightning_channeld(21914): TRACE: peer_in WIRE_ANNOUNCEMENT_SIGNATURES
DEBUG:root:lightningd: lightning_channeld(21914): TRACE: peer_out WIRE_CHANNEL_ANNOUNCEMENT
DEBUG:root:lightningd: lightning_channeld(21914): TRACE: peer_out WIRE_CHANNEL_UPDATE
DEBUG:root:lightningd: lightning_channeld(21914): TRACE: peer_out WIRE_NODE_ANNOUNCEMENT

DEBUG:root:lightningd: lightning_channeld(21914): TRACE: peer_in WIRE_CHANNEL_ANNOUNCEMENT
DEBUG:root:lightningd: lightning_channeld(21914): TRACE: peer_in WIRE_CHANNEL_UPDATE
DEBUG:root:lightningd: lightning_channeld(21914): TRACE: peer_in WIRE_NODE_ANNOUNCEMENT

So it seems once again that we simply do not wait for gossip to propagate. I'll add a sync point for gossip to settle as well.

sstone commented 6 years ago

can you please try with branch usecltvexpiry ? this and the sync you mention above (sometime we seem to be one block behind lightnind) should take care of most of these issues.

cdecker commented 6 years ago

For the failing lnd -> c-lightning, it looks like y'all rejected the HTLC lnd sent due to a timelock too near. Looking at the logs, I see we sent an HTLC with a timelock of 561 So from lnd'sPoV, the timelock is set is correct as:555 + 6 = 561`. So I think the issue is just a lack of synchronization between mining the blocks and waiting for both nodes to detect the same block height. We don't currently add any additional delay, yet (for shadow routes, as if we did this "tightness" wouldn't matter).

After implementing the sync barrier this turned out to be a red herring. We simply disagreed on the semantics of the min_final_cltv_expiry, and c-lightning was over eager in dropping incoming HTLCs because they had the exact delta on the expiry (see lightningnetwork/lightning-rfc#268 for details). Fixing this in c-lightning results in test number 2 working, and number 1 should be working with eclair's usecltvexpiry branch.

Retesting the payment tests now, keep your fingers crossed that this solves some of the issues.

cdecker commented 6 years ago

Ok, I can confirm that the direct payment issues are now resolved (however the test framework now requires channel announcements which kills the test_direct_payment[lnd_eclair] case again.

All failing tests currently complain about some channels not syncing. I think we can assume that most of this is due to #5, i.e., lnd not setting the announce_channel flag in the open_channel message.

The tests involving eclair are rather flaky, and I have no idea why, maybe @sstone can help me find a procedure that reliably sets up channels (funding multiple channels during without confirming them first seems to be problematic for example).

The good news is that simply jerry-rigging c-lightning to send announcements no matter the channel_flag and the usecltvexpiry branch for eclair increased success rate from 46/66 to 54/66, so we're getting closer :+1:

sstone commented 6 years ago

Yes we don't handle opening multiple channels in parallel very well and it becomes worse when multiple nodes are sharing the same bitcoind daemon(having one bitcoind per node, all connected together would improve things)

Roasbeef commented 6 years ago

Just pushed a commit to lnd which should address #5: https://github.com/lightningnetwork/lnd/compare/1b69940a8481...3764e5c65e90

sstone commented 6 years ago

Commit https://github.com/ACINQ/eclair/commit/340a16fa78718e9e1148a7c5224485c6125e4a3e should take care of the errors you see when funding multiple channels in parallel (should fix all failing tests more than 1 eclair node)

cdecker commented 6 years ago

Ok, currently testing all new versions on master, now with serial opening of channels. Will report new tests, and close this one for now.

cdecker / lightning-integration

[TRIAGE] New failures following migration to BOLT11 style payment requests #7